/lambda-scraper-queue

Demo project showing how to create a simple web scraping service using AWS Lambda and API Gateway

Primary LanguageJavaScriptISC LicenseISC

Lambda Scraper Queue

This is a demo project which implements a trivial REST service for queuing web scraping jobs.

It is completely "serverless", designed to use the following Amazon services:

The Lambda functions are written in ES6, with async/await, transpiled using Babel, and bundled using Webpack.

The AWS resources are provisioned using the CloudFormation service, using an add-on custom resource handler to allocate API Gateway resources (which Amazon doesn't support yet for CloudFormation).

Additionally, we use Apex to simplify the uploading of the Lambda functions.

Costs

It should cost very little to run.

  • DynamoDB - only provisioned for 1 read capacity unit, 1 write capacity unit (which limits it to 1 job per second)
  • S3 - storage for retrieved files and JSON, plus data transfer
  • CloudWatch logs
  • Lambda invocations

Demo Instance

API: https://3m7171w3c9.execute-api.us-west-2.amazonaws.com/prod

Web Interface: Under construction

API

Submit a job

curl -X POST -d url=http://jimpick.com/ https://3m7171w3c9.execute-api.us-west-2.amazonaws.com/prod/jobs

Deployment Instructions

Prerequisites

  • You will need an AWS Account
  • You will need OS X, Linux, *BSD or another Unix-based OS (scripts will need some modifications for Windows)
  • Install the AWS CLI and ensure credentials are setup under ~/.aws/credentials (Instructions)
  • Install Node.js (tested with v4.2.6 and v5.7.0)
  • git clone https://github.com/jimpick/lambda-scraper-queue.git (https)
    or
    git clone git@github.com:jimpick/lambda-scraper-queue.git (git)
  • cd lambda-scraper-queue
  • npm install

Setup IAM permissions

Note: These instructions are copied from: https://github.com/carlnordenfelt/aws-api-gateway-for-cloudformation#setup-iam-permissions

To be able to install the Custom Resource library you require a set of permissions. Configure your IAM user with the following policy and make sure that you have configured your aws-cli with access and secret key.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "cloudformation:CreateStack",
        "cloudformation:DescribeStacks",
        "iam:CreateRole",
        "iam:CreatePolicy",
        "iam:AttachRolePolicy",
        "iam:GetRole",
        "iam:PassRole",
        "lambda:CreateFunction",
        "lambda:UpdateFunctionCode",
        "lambda:GetFunctionConfiguration",

        "cloudformation:DeleteStack",
        "lambda:DeleteFunction",
        "iam:ListPolicyVersions",
        "iam:DetachRolePolicy",
        "iam:DeletePolicy",
        "iam:DeleteRole"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}

Install the Custom Resource Library

This installs a special AWS Lambda function so that the CloudFormation recipe can provision the API Gateway using custom resources from Carl Nordenfelt's API Gateway for CloudFormation project.

npm run deploy-custom-resource

If successful, a 'service token' will be saved to deploy/state/SERVICE_TOKEN

Configuration

Copy config.template.js to config.js and customize it.

cp config.template.js config.js

The default config.template.js is:

export default {
  cloudFormation: 'lambdaScraperQueue',
  region: 'us-west-2',
  stage: 'prod'
}

Parameters

cloudFormation: The name of the CloudFormation stack

region: The AWS region

stage: The API Gateway stage to create

Use CloudFormation to create the AWS resources

npm run create-cloudformation

The command returns immediately, but it will take a while to complete. it's deploying a lot of resources. It's a good idea to watch the CloudFormation task in the AWS Web Console to ensure that it completes without errors.

Note: When working with the CloudFormation recipe, you can also use npm run update-cloudformation and npm run delete-cloudformation

Manually create the "prod" deployment stage in API gateway

When the CloudFormation stack in the previous step has been successfully provisioned (check the AWS Web Console), do this step.

The Custom Resource library currently doesn't support this from CloudFormation, so, for now, we need to do it manually.

Go to "API Gateway" in the Amazon web console, and select the desired API. Click the Deploy API button, and under Deployment Stage, select New Stage. Enter prod for the Stage Name, and click the Deploy button.

Save the references to the provisioned CloudFormation resources

npm run save-cloudformation

This will create a file in deploy/state/cloudFormation.json

Setup the Apex build directory

npm run setup-apex

This generates build/apex/project.json

Compile the Lambda scripts using babel

npm run compile-lambda

This will use webpack and babel to compile the source code in src/server/lambdaFunctions into build/apex/functions

The webpack configuration is in deploy/apex/webpack.config.es6.js

Deploy the lambda functions

npm run deploy-lambda

This will run apex deploy in the build/apex directory to upload the compiled lambda functions.

Alternatively, if you want to execute the compile and deploy steps in one command, you can run: npm run deploy

Run the test suite

npm run test

This will run both the local tests, and remote test which test the deployed API and lambda functions.

The local tests can be run as npm run test-local, and the remote tests can be run as npm run test-remote.

View logs

You can tail the CloudWatch logs:

npm run logs

This just executes apex logs -f in build/apex

Submit a job

npm run post-url

Submits a job to the API that scrapes http://jimpick.com/

You should be able to see lambda output in the logs (after a few seconds delay). Also, you should be able to see the files in S3 via the AWS Web Console.

To Do List

  • Integration test
  • Better error handling
  • Handle DynamoDB ProvisionedThroughputExceededException
  • Status subsystem (API + Firebase)
  • Web interface
  • Quotas / Whitelists for public demo
  • Blog post

Similar Work

I'm using Apex, but just for uploading the functions. I haven't investigated the other projects yet.