Remember to make your Lambda functions idempotent

Todays post is about an AWS service I have been having some fun with, Lambda.

Essentially Lambda its a service which executes your code within millisecond of an "event" happening. An event may be your own action or it can be triggered by actions in other AWS services such as S3, DyamoDB or Kinesis. The great thing is there is no infrastructure to build or run and you pay only for the requests served and the compute time required to run your code. Billing is metered in increments of 100 milliseconds! Its "way cool". You can read all about it on the product page if you need an introduction. But this post is not about whats so cool about Lambda.

What I wanted to cover was that you need to make sure your functions that you write are idempotent. Idempotency in software "describes an operation that will produce the same results if executed once or multiple times". "It means that an operation can be repeated or retried as often as necessary without causing unintended effects."

Why is this important to remember with Lambda? Well there is some text in the documentation and FAQ that sort of explains why.

From the documentation. [highlight is mine]

Your Lambda function code must be written in a stateless style, and have no affinity with the underlying compute infrastructure. Your code should expect local file system access, child processes, and similar artifacts to be limited to the lifetime of the request, and store any persistent state in Amazon S3, Amazon DynamoDB, or another cloud storage service. Requiring functions to be stateless enables AWS Lambda to launch as many copies of a function as needed to scale to the incoming rate of events and requests. These functions may not always run on the same compute instance from request to request, and a given instance of your Lambda function may be used more than once by AWS Lambda.

Also from the FAQ.

Q: Will AWS Lambda reuse function instances?
To improve performance, AWS Lambda may choose to retain an instance of your function and reuse it to serve a subsequent request, rather than creating a new copy. Your code should not assume that this will always happen.

Today Lambda functions are written in Node.js. Here is my Lambda function which returns Twitter data combined with Amazon Machine Learning Predictions to tell me if those tweets are on topic (aka SPAM) or not. My use case was creating a tweet board that filtered junk message based on machine learning. It actually worked really well. But back to our code, you want to jump right to the end, not need to read it all.

getTweetsError = function (err, response, body) {
    console.log('ERROR [%s]', err);
};

function retrieveATweetPrediction(tweet) {

    // This is an async operation and we are going to have lots. Therefore we
    // will use a promise which we will
    // return for our caller to track. When we do our actual work we will mark
    // our little promise as resolved.

    var deferred = Q.defer();

    var req = aml.predict(
    {       
     MLModelId: '',
     PredictEndpoint: 'https://realtime.machinelearning.us-east-1.amazonaws.com',
     Record: { 
         text: tweet['text'].toString(),
         id: tweet['id'].toString(),
         followers: tweet['user']['followers_count'].toString(),
         favourites: tweet['favorite_count'].toString(),
         friends: tweet['user']['friends_count'].toString(),
         lists: tweet['user']['listed_count'].toString(),
         retweets: tweet['retweet_count'].toString(),
         tweets: tweet['user']['statuses_count'].toString(),
         user: tweet['user']['screen_name'].toString(),
    source: tweet['source'].toString(),
   }
    });

    // We did not pass a function to predict so we can call the .on function and 
    // get access to the complete response data. This allows us to look up the original request and 
    // tie this async call back to our original data. If we call it the normal way we dont have access
    // to that, just the response and can't tie it back!
    req.on('success', function(response) {
     if (response.error) {
      console.log(response.error)
     } else {
      var t = "";
   if (response.data.Prediction.predictedLabel == "0") {
          t += 'ON';
    } else {
       t += 'OFF';
         }
            returnData[response.request.params.Record.id]['prediction'] = t;

    var val = response.data.Prediction.predictedScores[response.data.Prediction.predictedLabel];
    if (val < 0.5 ) {
       val = 1 - val;
    }   
            returnData[response.request.params.Record.id]['probability'] = Math.round(val*100000)/1000;
            deferred.resolve(); // This task can now be marked as done
            
     }
    });
    req.send();
    return deferred.promise;
};

function extractTweets() {

    var deferred = Q.defer();

    twitter.getSearch({'q':'#aws','count': 15}, getTweetsError, 
    
        function (data) {

            var tweets = JSON.parse(data)['statuses'];

            // We need to create a list of tasks as we are going to fire off a bunch of async calls to 
            // do a prediction for each tweet.
            var tasks = [];

            for (i in tweets) {

                var id = tweets[i]['id'];
                returnData[id] = {}; 
                returnData[id]['text']       = tweets[i]['text'];
                returnData[id]['name']       = tweets[i]['user']['name'];
                returnData[id]['screen_name']= tweets[i]['user']['screen_name'];
                returnData[id]['followers']  = tweets[i]['user']['followers_count'];
                returnData[id]['friends']    = tweets[i]['user']['friends_count'];
                returnData[id]['listed']     = tweets[i]['user']['listed_count'];
                returnData[id]['statuses']   = tweets[i]['user']['statuses_count'];
                returnData[id]['retweets']   = tweets[i]['retweet_count'];
                returnData[id]['favourites'] = tweets[i]['favorite_count'];
                returnData[id]['source']     = tweets[i]['source'];
                returnData[id]['image_url']  = tweets[i]['user']['profile_image_url'];

                // The prediction return a promise which we will push into our list of tasks.
                // When the prediction is returned it will mark its little task as resolved.
                tasks.push(retrieveATweetPrediction(tweets[i]));
            }

            // We have a list of tasks which are happening. Lets wait till ALL of them are done.
            Q.all(tasks).then(function(result) { 
                // Woot woot, all predicitons are returned and we have our data!
                // We are therefore resolved ourselves now. Whoever is waiting on us is going to 
                // now get some further stuff done.
                deferred.resolve();
            });
        }
    );
    return deferred.promise;
};

// End of Functions, let look at out main bit of code.

// Setup AWS SDK
var aws = require('aws-sdk');
aws.config.region = 'us-east-1';
var aml = new aws.MachineLearning();

// Setup Twitter SDK
var Twitter = require('twitter-node-client').Twitter;
var twitter = new Twitter({
    "consumerKey": "",
    "consumerSecret": "",
    "accessToken": "",
    "accessTokenSecret": "",
    "callBackUrl": ""
});

// Setup Q for our promises, we have lots of calls to make and we need to track when they are all done!
var Q = require('q');

var returnData = {};

// This is the function required by Lambda
exports.handler = function(event, context) {

    returnData = {}; // We may be reincarnated so ensure we are idempotent 
    
    Q.allSettled([extractTweets()]).then(
        function(result){
            // Return our data an end the Lambda function
            context.succeed(returnData);
        },
        function(reason){
            console.log("Opps : " + reason);
        });

};

See how there are lots of functions then some code which sets up some variables, Q and returnData, and then the main function which Lambda will call when an event occurs, exports.handler. Notice how I am not a great coder and I used a global variable to store some data which is used by all of the functions. Well if exports.handler gets called over and over again in the same environment those global variables will not be re-created or cleared. I did not quite realize this at first and wondered why I was sometimes getting weird data back from Lambda, not always, just sometimes.

To fix my problem I simple ensured that I cleared the key variable each time the handler function was called, so you can see that the first thing it does above is the "returnData = {}; // We may be reincarnated so ensure we are idempotent". Fixed. Of course I know I could just code better, but this was my first ever time writing node.js. You can tell me how to improve my function in the comments.

I will probably do another writeup on my Amazon Machine Learning experiment and how I trained it to filter tweets, it was really easy and I have no servers involved, thanks to Lambda to execute my application logic, so I just have S3, Lambda and AML Live Prediction for a highly scalable site.

Hopefully you won't get caught by the same mistake.

Rodos

Musings of Rodos

Remember to make your Lambda functions idempotent

Leave a Reply

Rodney Haywood

Archives

TripIt

Categories