/lambda-refarch-fileprocessing

Serverless Reference Architecture for Real-time File Processing

Primary LanguageJavaScriptApache License 2.0Apache-2.0

Serverless Reference Architecture: Real-time File Processing

README Languages: DE | ES | FR | IT | JP | KR | PT | RU | CN | TW

The Real-time File Processing reference architecture is a general-purpose, event-driven, parallel data processing architecture that uses AWS Lambda. This architecture is ideal for workloads that need more than one data derivative of an object. This simple architecture is described in this diagram and "Fanout S3 Event Notifications to Multiple Endpoints" blog post on the AWS Compute Blog. This sample application demonstrates a Markdown conversion application where Lambda is used to convert Markdown files to HTML and plain text.

Running the Example

You can use the provided AWS CloudFormation template to launch a stack that demonstrates the Lambda file processing reference architecture. Details about the resources created by this template are provided in the CloudFormation Template Resources section of this document.

Important Because the AWS CloudFormation stack name is used in the name of the Amazon Simple Storage Service (Amazon S3) buckets, that stack name must only contain lowercase letters. Use lowercase letters when typing the stack name. The provided CloudFormation template retrieves its Lambda code from a bucket in the us-east-1 region. To launch this sample in another region, please modify the template and upload the Lambda code to a bucket in that region.

Choose Launch Stack to launch the template in the us-east-1 region in your account:

Launch Lambda File Processing into North Virginia with CloudFormation

Alternatively, you can use the following command to launch the stack using the AWS CLI. This assumes you have already installed the AWS CLI.

aws cloudformation create-stack \
    --stack-name lambda-file-processing \
    --template-url https://s3.amazonaws.com/awslambda-reference-architectures/file-processing/lambda_file_processing.template \
    --capabilities CAPABILITY_IAM

Using SAM

Install the dependencies for lambda

cd src/data-processor-1 && npm install async marked
cd  src/data-processor-2 && npm install async marked

Run SAM package (equivalent to aws cloudformation package)

aws cloudformation package \
    --template-file lambda_file_processing.yml \
    --s3-bucket sam-stuff \
    --output-template-file post-sam.yml

Deploy the SAM template

aws cloudformation deploy \
    --template-file ./post-sam.yml \
    --stack-name lambda-file-refarch \
    --capabilities CAPABILITY_IAM

Testing the Example

After you have created the stack using the CloudFormation template, you can test the system by uploading a Markdown file to the InputBucket that was created in the stack. You can use this README.md file in the repository as an example file. After the file has been uploaded, you can see the resulting HTML and plain text files in the output bucket of your stack. You can also view the CloudWatch logs for each of the functions in order to see the details of their execution.

You can use the following commands to copy a sample file from the provided S3 bucket into the input bucket of your stack.

BUCKET=$(aws cloudformation describe-stack-resource --stack-name lambda-file-processing --logical-resource-id InputBucket --query "StackResourceDetail.PhysicalResourceId" --output text)
aws s3 cp s3://awslambda-reference-architectures/file-processing/example.md s3://$BUCKET/example.md

After the file has been uploaded to the input bucket, you can inspect the output bucket to see the rendered HTML and plain text output files created by the Lambda functions.

You can also view the CloudWatch logs generated by the Lambda functions.

Cleaning Up the Example Resources

To remove all resources created by this example, do the following:

  1. Delete all objects in the input and output buckets.
  2. Delete the CloudFormation stack.
  3. Delete the CloudWatch log groups that contain the execution logs for the two processor functions.

CloudFormation Template Resources

Parameters

  • CodeBucket - Name of the S3 bucket in the stack's region that contains the code for the two Lambda functions, ProcessorFunctionOne and ProcessorFunctionTwo. Defaults to the managed bucket awslambda-reference-architectures.

  • CodeKeyPrefix - The key prefix for the Lambda function code relative to CodeBucket. Defaults to file-processing.

Resources

The provided template creates the following resources:

  • InputBucket - An S3 bucket that holds the raw Markdown files. Uploading a file to this bucket will trigger both processing functions.

  • OutputBucket - An S3 bucket that is populated by the processor functions with the transformed files.

  • InputNotificationTopic - An Amazon Simple Notification Service (Amazon SNS) topic used to invoke multiple Lambda functions in response to each object creation notification.

  • NotificationPolicy - An Amazon SNS topic policy which permits InputBucket to call the Publish action on the topic.

  • ProcessorFunctionOne - An AWS Lambda function that converts Markdown files to HTML. The deployment package for this function must be located at s3://[CodeBucket]/[CodeKeyPrefix]/data-processor-1.zip.

  • ProcessorFunctionTwo - An AWS Lambda function that converts Markdown files to plain text. The deployment package for this function must be located at s3://[CodeBucket]/[CodeKeyPrefix]/data-processor-2.zip.

  • LambdaExecutionRole - An AWS Identity and Access Management (IAM) role used by the two Lambda functions.

  • RolePolicy - An IAM policy associated with LambdaExecutionRole that allows the functions to get objects from InputBucket, put object to OutputBucket and log to Amazon CloudWatch.

  • LambdaInvokePermissionOne - A policy that enables Amazon SNS to invoke ProcessorFunctionOne based on notifications from InputNotificationTopic.

  • LambdaInvokePermissionTwo - A policy that enables Amazon SNS to invoke ProcessorFunctionTwo based on notifications from InputNotificationTopic.

License

This reference architecture sample is licensed under Apache 2.0.