/utsusemi

A tool to generate a static website by crawling the original site.

Primary LanguageJavaScript

utsusemi Build Status

logo

utsusemi = "空蝉"

A tool to generate a static website by crawling the original site.

Using framework

  • Serverless Framework ⚡

How to deploy

:octocat: STEP 1. Clone

$ git clone https://github.com/k1LoW/utsusemi.git
$ cd utsusemi
$ npm install

📝 STEP 2. Edit config

Copy config.example.yml to config.yml. And edit.

🚀 STEP 3. Deploy to AWS

$ AWS_PROFILE=XXxxXXX npm run deploy

And get endpoints URL and UtsusemiWebsiteURL

💣 Destroy utsusemi

  1. Call API /delete?path=/
  2. Run following command.
$ AWS_PROFILE=XXxxXXX npm run destroy

Usage

Start crawling /in?path={startPath}&depth={crawlDepth}

Start crawling to targetHost.

$ curl https://xxxxxxxxxx.execute-api.ap-northeast-1.amazonaws.com/v0/in?path=/&depth=3

And, access UtsusemiWebsiteURL.

force option

Disable cache

$ curl https://xxxxxxxxxx.execute-api.ap-northeast-1.amazonaws.com/v0/in?path=/&depth=3&force=1

Purge crawling queue /purge

Cancel crawling.

$ curl https://xxxxxxxxxx.execute-api.ap-northeast-1.amazonaws.com/v0/purge

Delete object of utsusemi content /delete?prefix={objectPrefix}

Delete S3 object.

$ curl https://xxxxxxxxxx.execute-api.ap-northeast-1.amazonaws.com/v0/delete?path=/

Show crawling queue status /status

$ curl https://xxxxxxxxxx.execute-api.ap-northeast-1.amazonaws.com/v0/status

Set N crawling action POST /nin

Start crawling to targetHost with N crawling action.

$ curl -X POST -H "Content-Type: application/json" -d @nin-sample.json https://xxxxxxxxxx.execute-api.ap-northeast-1.amazonaws.com/v0/nin

Architecture

Architecture

Crawling rule

  • HTML -> depth = depth - 1
  • CSS -> The source request in the CSS does not consume depth.
  • Other contents -> End ( depth = 0 )
  • 403, 404, 410 -> Delete S3 object