Lynks Scraper

Scraping component for the Lynks project to extract content from specified webpages. Provides a simple web API to generate a number of resources or page metadata from a given URL.

Key Features

extract thumbnail and preview images from headless Chrome sessions
generate full page screenshots and PDF documents from pages
extract key page content (articles) and generate 'readable' versions
extract metadata from a given page, including title, keywords, description, author, publish date etc

Libraries Used

Express - minimalist web framework
Puppeteer (and plugins from extra) - headless Chrome sessions
JSDom and DomPurify - create and sanitize DOM's in Node.js
Readability - create 'reader view' versions of webpages without the clutter
Sharp - fast image resizing
Winston - logging
Jest and Supertest - testing

Routes

Available Resource Types

Pass in as payload into each endpoint to define which resources should be generated:

SCREENSHOT - full page screenshot as PNG
THUMBNAIL - primary extracted image from page or small screenshot as 320x180 JPG
PREVIEW - partial page screenshot from primary image or headless session as 640x360 JPG
PAGE - full HTML page after sanitization
DOCUMENT - full page PDF
READABLE_DOC - reader view of page as HTML (with images)
READABLE_TEXT- reader view of page as text only

Suggest

Extract key metadata and simple preview/thumbnail to support the Lynks suggest functionality. This endpoint is user-facing and so does not use headless Chrome for performance purposes. Instead, the full page is downloaded as HTML and then analyzed. As no session is created, the thumbnail and preview images are extracted from the primary page image (if present). Only a subset of the overall resource types are available.

POST /api/suggest --> generate a suggestion and number of resources from a given URL

{
  "url": "https://foojay.io/today/demystifying-jvm-memory-management/",
  "resourceTypes": [
    "PREVIEW",
    "THUMBNAIL",
    "READABLE_TEXT"
  ],
  "targetPath": "/where/to/save/resources"
}

url - the URL to generate a suggestion for
resourceTypes - a subset of resource types to generate and save to the specified path. Headless Chrome is not used to generate these resources. Only PREVIEW, THUMBNAIL and READABLE_TEXT types are available
targetPath - the absolute path to the directory in which to save the specified resources to. The directory will be created if it doesn't already exist

Response is a details object containing metadata alongside list of resource objects with type, extension and the absolute path to the target location on the filesystem:

{
  "details": {
    "url": "https://deepu.tech/memory-management-in-jvm/",
    "title": "Demystifying Java Virtual Machine Memory Management",
    "keywords": [
      "java",
      "memory",
      "management"
    ],
    "description": "I aim to demystify the concepts behind memory management and take a look at memory management in some of the modern programming languages.",
    "image": "https://i.imgur.com/Kv9ichJ.gif",
    "author": "Deepu K Sasidharan",
    "published": "2021-05-20T07:26:37+00:00"
  },
  "resources": [
    {
      "resourceType": "PREVIEW",
      "targetPath": "/where/to/save/resources/preview.jpg",
      "extension": "jpg"
    },
    {
      "resourceType": "THUMBNAIL",
      "targetPath": "/where/to/save/resources/thumbnail.jpg",
      "extension": "jpg"
    },
    {
      "resourceType": "READABLE_TEXT",
      "targetPath": "/where/to/save/resources/readable_text.txt",
      "extension": "txt"
    }
  ]
}

Scrape

Perform a full scrape of the target URL using a headless Chrome session. Output a set of resources, including screenshots, readable versions, previews and PDF's to the provided target path.

POST /api/scrape --> scrape the target URL and generate a number of resources

{
  "url": "https://foojay.io/today/demystifying-jvm-memory-management",
  "resourceTypes": [
    "SCREENSHOT",
    "PREVIEW",
    "DOCUMENT",
    "READABLE_TEXT",
    "READABLE_DOC",
    "PAGE",
    "THUMBNAIL"
  ],
  "targetPath": "/where/to/save/resources"
}

url - the URL to scrape
resourceTypes - a subset of resource types to generate and save to the specified path. Headless Chrome will be used to generate these resources
targetPath - the absolute path to the directory in which to save the specified resources to. The directory will be created if it doesn't already exist

Response is a list of resource objects with type, extension and the absolute path to the target location on the filesystem:

[
  {
    "resourceType": "SCREENSHOT",
    "targetPath": "/where/to/save/resources/screenshot.png",
    "extension": "png"
  },
  {
    "resourceType": "PREVIEW",
    "targetPath": "/where/to/save/resources/preview.jpg",
    "extension": "jpg"
  },
  {
    "resourceType": "THUMBNAIL",
    "targetPath": "/where/to/save/resources/thumbnail.jpg",
    "extension": "jpg"
  },
  {
    "resourceType": "DOCUMENT",
    "targetPath": "/where/to/save/resources/document.pdf",
    "extension": "pdf"
  },
  {
    "resourceType": "PAGE",
    "targetPath": "/where/to/save/resources/page.html",
    "extension": "html"
  },
  {
    "resourceType": "READABLE_TEXT",
    "targetPath": "/where/to/save/resources/readable_text.txt",
    "extension": "txt"
  },
  {
    "resourceType": "READABLE_DOC",
    "targetPath": "/where/to/save/resources/readable_doc.html",
    "extension": "html"
  }
]