Like other AWS services, AWS IoT provides useful CloudWatch metrics, but it does not provide metrics related to latency such as how long it takes to find target rules or how long it takes to start and complete the target rule actions. In this blog, I’d like to show how we can get to those metrics by analyzing AWS IoT CloudWatch Logs.
First, we need to fulfill a few prerequisites. Start by enabling IoT CloudWatch Logs since it is disabled by default. Go to the “IoT Core” console and click the “Settings” menu located on the bottom of the left menu bar. In the “Logs” section (shown below), click “Edit.”
On the “Configure role setting” page, change “Level of verbosity” to “Debug (most verbose)” and click “Create new” to create a role that permits logging. In “Create a new role” pop-up window, enter a role name you want and click “Create role” button. Once done, click “Update role” link and “Update” button to complete. The “Logs” section will now show that the logs are enabled.
After this, if you see the following message, click ‘Upgrade.’
In addition to enabling the Logs, we need to enable event-based messages. In the next section on the same page, “Event-based messages,” click the “Edit” button and enable the first two “Job” items related with events by clicking their checkboxes. Click ‘Save’ and now you should see that those two events are enabled (see below).
Because you’re in the “Settings” page, please make a note of the “Endpoint” value for future use. It’s located in the first section, which is the “Custom endpoint” section.
We need to initialize the CloudWatch Log Group that is necessary to create a CloudFormation stack below, so let’s send a random message to a dummy topic. Click “Test” menu in the same left menu bar of “IoT Core” console and you will see “Publish” section as below. Please enter any string and click “Publish to topic” to send a message to a dummy (the string you entered) topic, which initializes the Log Group.
Now go to the “CloudWatch” console and click the “Logs” menu in the left menu bar. Confirm you can find “AWSIotLogsV2” Log Group. (If you don’t see it, verify that the “Level of verbosity” is “Debug (most verbose)” in the “Setting” of IoT Core console)
Once all the above preparation processes are completed, create a CloudFormation stack to deploy the log analyzer along with some sample IoT topics that will trigger rule actions, whose logs will be analyzed for latencies.
Please click the button below to create a stack using a Cloudformation:
- In ‘Select Template’ page, click ‘Next’.
- In “Specify Details’ page, leave the default values for stack name & parameters and click ‘Next’
- In ‘Options’ page, click ‘Next’.
- In ‘Review’ page, click all checkboxes in ‘Capabilities’, click ‘Create Change Set’ button to transform SAM (Serverless Application Model) to a normal template, and click ‘Execute’.
Once the stack is successfully created, choose the newly created stack, ‘sungardas-iot-analyzer’ and click the ‘Outputs’ tab below. Write down the value of “LogAnalyzerAttrStreamArn” and “LogAnalyzerShadowStreamArn,” which will be used when we register IoT devices and send some messages.
We’ve created four different IoT action rules to compare their latencies. Click the “Act” menu in the left menu bar from the “IoT Core” console and you’ll see them as below.
This rule sends messages to a Lambda function that feeds the messages to a Kinesis stream.
This rule finds a target Kinesis stream from the sender device’s shadow document and feeds the messages to the target stream.
This rule finds a target Kinesis stream from the sender device’s attribute and feeds the messages to the target stream.
This rule feeds the messages to a pre-defined Kinesis stream directly.
Click each rule and check the detail for more information.
Next, let’s register devices that send messages to trigger the above four different rules. To run the script to register the devices, please clone this GitHub repository using this command:
$ git clone https://github.com/SungardAS/iot-log-analyzer.git
Once completed, move to the “iot-log-analyzer” folder. Go to the “register” folder and run commands below to register the devices to AWS IoT after installing IoT python SDK. Please set the same AWS region with the created stack when you run ‘register.py’.
$ pip install -r requirements.txt
$ python register.py -e <iot_rest_api_endpoint> -r principals/root-CA.crt -ak <attr_kinesis_stream_arn> -sk <shadow_kinesis_stream_arn>
Here, you should see
<iot_rest_api_endpoint>: “Endpoint” value you noted like “xxx.iot.xxx.amazonaws.com”
<attr_kinesis_stream_arn>: “LogAnalyzerAttrStreamArn” value in Stack’s Outputs
<shadow_kinesis_stream_arn>: “LogAnalyzerShadowStreamArn” value in Stack’s Outputs.
Once the command is successfully completed, you will see the registered devices’ certificates and private keys under “principals” folder. The script registers four devices and sends messages to four different topics that triggers the above four different action rules we reviewed. You can see those devices in “Manage” – “Thing” menu in “IoT Core” console as below.
Click each device and check the detail for more information.
Everything should be set up. It is now time to fire some messages. Once messages are fired, the analyzer will parse the IoT log file and create three CloudWatch metrics for each topic as below.
This metric will store time in milliseconds from when a message is published to when a matching rule is found
- RuleMatch_ StartingRuleExecution
This metric will store time in milliseconds from when a matching rule is found to when the action of the rule starts
- StartingRuleExecution_ RuleExecution
This metric will store time in milliseconds how long the rule action takes until the completion
To fire messages, stay in the same “register” folder under “iot-log-analyzer” and run below command, which will send messages into four different topics every 20 seconds.
$ python sender.py -e <iot_rest_api_endpoint> -r principals/root-CA.crt
Here, <iot_rest_api_endpoint> is the same “Endpoint” value in “register.py”
While the script is sending messages, let’s see the metrics generated in CloudWatch console. In the console, select “Metrics” menu in the left menu bar, then you’ll see a custom metrics called ‘IoTEvent’ in the metrics list on the right. After choosing the metric, click “TopicName,” then it will display the above three new metrics for each topic. To see the metrics data, choose all metrics and move to “Graphed metrics (12)” tab and change “Period” of all metrics into “1 Second” to see more quick updates as below.
You will see the metrics graph will display all the latency information on the top of the page. For one example, please uncheck the metrics and check only “RuleMatch_StartingRuleExecution” metric of each topic and hover the mouse on the graph. It will show the actual latency value as below.
As you can see on the top right, the latency values of ‘attr’ and ‘shadow’ (115 and 73.0) in a red box are much higher than the values of ‘lambda’ and ‘direct’. In case you’re wondering why, it is because ‘attr’ and ‘shadow’ rules are finding their Kinesis stream information from the device’s attribute and shadow document respectively before they trigger the action. Because this kind of data is not available in the current IoT CloudWatch metrics, the custom metrics we generate by analyzing the IoT logs will help to trace and monitor these specific latencies.
In summary, we’ve built a system that generate custom CloudWatch metrics related with latencies in executing IoT rule actions by analyzing IoT logs. Also, I showed one example of useful latency metrics data. Please check the open source project site, https://github.com/SungardAS/iot-log-analyzer to get the all resources like SAM templates and codes.
Alex is a CTO Architect in the CTO Architecture Team, which is responsible for researching and evaluating innovative technologies and processes. He has been working at Sungard Availability Services for more than 14 years.
Before joining this team, he worked on developing public/private cloud orchestration platform and integrating various applications to automate the processes in managed hosting service companies including Sungard Availability Services. Prior to the managed hosting companies, he worked at Samsung for 9 years in developing a reporting tool and RDBMS engine.