AWS Lambda(Lambda) compute service, built to automatically scale applications or backend services based on event-driven programming, can play a major role in implementing–as well as in the re-engineering of big data solutions popularly associated with the Hadoop framework. Lambda is not a complete substitute for the evolving Hadoop framework, rather it can be a powerful alternative tool for some components in big data architecture.

Popular and commonly implemented big data use cases can be classified under behavioral analytics, predictive analytics, customer sentiment analysis, customer segmentation and fraud detection areas. Social networks like Facebook, Google+, Twitter, and YouTube play a big role along with website and mobile application interactions.

Typically, the solution can be divided into four stages–collect, store, process and analyze–with different AWS services and flows joining them together. Lambda can be used in all the four functions mentioned above. However it can play a significant role in the store and process functions combined with other components in the architecture.

Here are some of the top features and benefits of why enterprises must consider Lambda in their big data solution architecture

  • Eliminating the need for the infrastructure for large-scale distributed processing, which is traditionally handled by the Master-Slave-Task instances of the Hadoop framework.
  • Application code with event handler code replacing the traditional map-reduce type of programming in a number of industry standard programming languages.
  • Ability to trigger events, including real-time streaming integrated with services such as Amazon DynamoDB, S3, Kinesis streams, SNS, and Cloudwatch.
  • On-demand service, scalability, elasticity, pay-per-use (for capex reduction), security, overall cost of processing and the ability to support the volume and velocity to infinite (almost!) capacity is what attracts many companies to adapt or try out the big data solutions in the AWS platform. Typically AWS big data solutions platform revolves around Amazon EMR (EMR). However Lambda by design is engineered to meet these features, as well.
  • AWS Lambda and Amazon Kinesis can be used to process real-time streaming data for application activity tracking, transaction order processing, click stream analysis, data cleansing, metrics generation, log filtering, indexing, social media analysis and IoT device data processing.
  • NoOps – Lambda functions can also be invoked on demand and all the compute capacity and resources needed is automatically managed by AWS by spinning the necessary infrastructure, managing the code and execution.

A simple word count test was conducted using EMR, as well as Lambda, by using the inputs from Twitter. Some comparison information is documented below between the two approaches.

Word Count Test

Process the input file(s) in one S3 bucket and summarize the word and its count in another bucket. Also used another Lambda function to collect the data from twitter to retrieve the tweet output for the latest 1000 tweets on the status timeline and store it in the s3 bucket to be used as the input in this processing.

Option 1: Simple Reference Architecture with AWS Lambda
The following diagram depicts a simple use case of a social media or log file analysis in which the data text files are available in an S3 bucket as input and processed by Lambda functions, with the processed results being written to another S3 bucket. Typically, this can be done as a 2-step process similar to the EMR by having a Map-Scatter function to distribute inputs and a?Reduce-Gather function to process the inputs. The following code snippet provides an alternative by doing both in the same function.


Lambda function-1: Collect data from twitter

For this function select Lambda, then click new function and skip the blueprint to get to the configure screen. These are the fields that needs to be filled:

Name - load_from_twitter_status
Runtime – Node.js
Code entry type – Upload .ZIP file
Handler -   index.handler
Role – You can create a new role “s3 execution role” or select an existing role. Iin both cases, ensure required or all s3 actions are available to create object in a S3 bucket.
Memory – Leave the default 128 MB.
Timeout – Keep sufficient time 0 minutes and 30 seconds. Or, 59 seconds can be a good start.

Node zip instructions:

The zip file should be done inside the node project such that the index.js, package.json and the node_modules are at the same level among other files, as needed.

Providing some snippets from index.js that can be useful in building this function.

//This is required for the handler to invoke the code with in this
exports.handler = function(event, context) {
var AWS = require('aws-sdk');
// Obtain twitter application authorization details from
var twit = new twitter({    
    consumer_key: 'abc…………….’,
    consumer_secret: 'epE……………………………………………………….',
    access_token_key: '3216……-Qza…………………….',
    access_token_secret: 'jn……………………………………………………………….'
doCall(twit, function(response){
  //Convert to response type to String or other allowed data types for s3 put object
                    var twitoutput = JSON.stringify(response);
                  // Set name of the object that gets generated
                    var s3key = “objectname…..”;
                    var params = {Bucket: 'myBucket/input', Key: s3key, Body: twitoutput}
                    s3bucket.putObject(params, function(err, data) 
                    console.log("Entered put bucket function");
                                 if (err) {
                                            console.log("Error uploading data: ", err);
                                          } else {
                                            console.log("Successfully uploaded data to bucket/sub-bucket/");
                    // context.done will allows for the s3 operation to be complete as context.succeed might end up  
                    // closing the function before the feedback is arrived from s3
                                            context.done();  }
function doCall(twit, callback) {
// call twitter API functions and return the required tweets
var params = {count: 1000};
twit.get('statuses/home_timeline', params, function(error, tweets, response){
  if (!error) {             
                     return callback(tweets);
if(error) {
  var fail="failed";
  return callback(fail);

Function execution time: REPORT RequestId: 0b93ed00-cc72-11e5-a52c-75fb3dd236ce Duration: 8848.41 ms Billed Duration: 8900 ms Memory Size: 128 MB Max Memory Used: 36 MB


Lambda function-2: Process data word count

Click Lambda new function and choose the blueprint s3-get-object-python

Choose the S3 bucket for input files and Event Type – Object created (All)

Name - load_from_twitter_status
Runtime – Python 2.7
Code entry type – Edit code inline
Handler -   lambda_function.lambda_handler
Role: You can create a new role “s3 execution role” or select an existing role and in both cases ensure required or all s3 actions are available to create object in a S3 bucket.
Memory – Leave the default 128 MB
Timeout – Keep sufficient time 0 minutes and 30 seconds or 59 seconds can be a good start
from __future__ import print_function
from time import strftime
import json
import urllib
import boto3
import re

s3 = boto3.client('s3')
s3r = boto3.resource('s3')

def lambda_handler(event, context):
 bucket = event['Records'][0]['s3']['bucket']['name']
 key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf8')
 response = s3.get_object(Bucket=bucket, Key=key)
 data = response['Body'].read()
 pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
 word_freq = {}
 for word in pattern.findall(data):
 lowerword = word.lower();
 if lowerword not in wordcount:
 wordcount[lowerword] = 1
 wordcount[lowerword] += 1

 list = sorted(wordcount.keys())
 for word in list: print ("%-10s %d" % (word, wordcount[word]))
 for word in list: y=y+"{}\t{}\n".format(word, wordcount[word]) 
 timekey = strftime("%Y-%m-%d %H:%M:%S")
 time = strftime("%H:%M:%S")
 s3r.Bucket('my-target-bucket-name').put_object(Key=timekey, Body=y)
 except Exception as e:
 print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
 raise e

Function execution time: REPORT RequestId: 10920cac-cdc0-11e5-aa64-e7de12986527 Duration: 1623.36 ms Billed Duration: 1700 ms Memory Size: 128 MB Max Memory Used: 31 MB


Option 2: EMR processing

The following diagram depicts a simple use case of log file analysis in which the data text files are available in an Amazon S3 bucket as input and processed by an Amazon EMR cluster, with the processed data being written to another S3 bucket.


In the AWS console Click EMR , click “Create Cluster”, click “Go to advanced options”,

Leave the software configurations defaults as is and select streaming program in the add steps option and Click configure.




Accept all defaults in the General options screen

EC2 key pair: Accept the default or choose one of your keys


Step execution details: (log file: controller will provide these details)

2016-02-06T03:01:44.084Z INFO Step created jobs: job_1454727481331_0001

2016-02-06T03:01:44.084Z INFO Step succeeded with exitCode 0 and took 66 seconds

Tasks for: s-2EOXKPVVPA3TU, Job 1454727481331_0001

Task summary: 23 total tasks – 23 completed(17 Map/6 Reduce), 0 running, 0 failed, 0 pending, 0 cancelled.



The table below summarizes the comparison from the two approaches:


EMR approach Lambda approach
Storage S3 buckets for input and Output S3 buckets for input and Output
Processing 3 EC2 instances m3.xlarge (1 master and 2 core instances)2 Security groups for Master and core1 Key for accessing the instances 1 Lambda function for processing the input and storing in the output.1 role for policy to allow to read objects from s3 and write objects to s3 and suggest the bucket name be prefixed to allow all s3 actions.
 Cost model Charges will be incurred for all the master and core instances in the EMR cluster. With Lambda, there is no administration cost and the charges are only associated with the consumed compute time when the code is running.
Time for processing 66 seconds with 17 Map jobs and 6 reduce jobs Max: 2200ms (2 seconds)
Cost comparison For the standard 3 instances 25% utilization in a month m3.xlarge is 192$ or m1.large 120$.For 2 instances it will be 2/3 of the cost. 10950 requests / Assuming 3000 ms with the memory of 128GB will cost less than a 1$

Note: Typically, for huge volumes of time-sensitive real-time streaming data, tools like Amazon Kinesis can be used in combination with Lambda. Similarly for storage choices we can include S3 for files, Kinesis for streams and RDS Aurora and DynamoDB  for transactional data or Redshift for supporting analysis.

With AWS Lambda service, and its unique features of Noops, serverless computing and continuous scaling can be disruptive in big data solutions. Depending on the use case, it could be very cost effective, as well. Therefore, it is worthwhile to consider Lambda during the architecture decision and design processes while architecting and designing big data use cases.



Currently a CTO Architect with Sungard Availability Services, I have over 28 years of IT experience including full life cycle application delivery, consulting services, Technical project management and solutions architecture. As a cross sector cross platform architect had exposure to many IT technologies through its evolution and worked with a number of enterprise clients.


  • Roy ben-alta

    Hi Abay,

    Great post. One correction in regards to AWS Lambda and integration points. You wont be able to integrate events from Kafka and Lambda but only for Amazon Kinesis.
    More details can be found here:

    Posted on

    • Abay Radhakrishnan

      Thanks Roy. For the on-prem or Saas type of integration my thought is if we leverage Apache kafka and use one of the events as inputting to a s3, SNS or https in turn trigger Lambda was the idea behind it. Since there is no direct integration between Kafka and Lambda as you point out I will remove or rephrase it soon.

      Posted on

  • shantanu oak

    I could not copy-paste the code correctly. Can you please share it on github (or gist)?

    Posted on

    • Abay Radhakrishnan

      Shantanu – Thanks for the feedback. I will share the code snippets with in a week’s time.

      Posted on