/lynks-scraper

Web scraper for Lynks project. Extract screenshots, thumbnails, PDF's and metadata from given URL's

Primary LanguageJavaScript

Build codecov

Lynks Scraper

Scraping component for the Lynks project to extract content from specified webpages. Provides a simple web API to generate a number of resources or page metadata from a given URL.

Key Features

  • extract thumbnail and preview images from headless Chrome sessions
  • generate full page screenshots and PDF documents from pages
  • extract key page content (articles) and generate 'readable' versions
  • extract metadata from a given page, including title, keywords, description, author, publish date etc

Libraries Used

Routes

Available Resource Types

Pass in as payload into each endpoint to define which resources should be generated:

  • SCREENSHOT - full page screenshot as PNG
  • THUMBNAIL - primary extracted image from page or small screenshot as 320x180 JPG
  • PREVIEW - partial page screenshot from primary image or headless session as 640x360 JPG
  • PAGE - full HTML page after sanitization
  • DOCUMENT - full page PDF
  • READABLE_DOC - reader view of page as HTML (with images)
  • READABLE_TEXT- reader view of page as text only

Suggest

Extract key metadata and simple preview/thumbnail to support the Lynks suggest functionality. This endpoint is user-facing and so does not use headless Chrome for performance purposes. Instead, the full page is downloaded as HTML and then analyzed. As no session is created, the thumbnail and preview images are extracted from the primary page image (if present). Only a subset of the overall resource types are available.

POST /api/suggest --> generate a suggestion and number of resources from a given URL

{
  "url": "https://foojay.io/today/demystifying-jvm-memory-management/",
  "resourceTypes": [
    "PREVIEW",
    "THUMBNAIL",
    "READABLE_TEXT"
  ],
  "targetPath": "/where/to/save/resources"
}
  • url - the URL to generate a suggestion for
  • resourceTypes - a subset of resource types to generate and save to the specified path. Headless Chrome is not used to generate these resources. Only PREVIEW, THUMBNAIL and READABLE_TEXT types are available
  • targetPath - the absolute path to the directory in which to save the specified resources to. The directory will be created if it doesn't already exist

Response is a details object containing metadata alongside list of resource objects with type, extension and the absolute path to the target location on the filesystem:

{
  "details": {
    "url": "https://deepu.tech/memory-management-in-jvm/",
    "title": "Demystifying Java Virtual Machine Memory Management",
    "keywords": [
      "java",
      "memory",
      "management"
    ],
    "description": "I aim to demystify the concepts behind memory management and take a look at memory management in some of the modern programming languages.",
    "image": "https://i.imgur.com/Kv9ichJ.gif",
    "author": "Deepu K Sasidharan",
    "published": "2021-05-20T07:26:37+00:00"
  },
  "resources": [
    {
      "resourceType": "PREVIEW",
      "targetPath": "/where/to/save/resources/preview.jpg",
      "extension": "jpg"
    },
    {
      "resourceType": "THUMBNAIL",
      "targetPath": "/where/to/save/resources/thumbnail.jpg",
      "extension": "jpg"
    },
    {
      "resourceType": "READABLE_TEXT",
      "targetPath": "/where/to/save/resources/readable_text.txt",
      "extension": "txt"
    }
  ]
}

Scrape

Perform a full scrape of the target URL using a headless Chrome session. Output a set of resources, including screenshots, readable versions, previews and PDF's to the provided target path.

POST /api/scrape --> scrape the target URL and generate a number of resources

{
  "url": "https://foojay.io/today/demystifying-jvm-memory-management",
  "resourceTypes": [
    "SCREENSHOT",
    "PREVIEW",
    "DOCUMENT",
    "READABLE_TEXT",
    "READABLE_DOC",
    "PAGE",
    "THUMBNAIL"
  ],
  "targetPath": "/where/to/save/resources"
}
  • url - the URL to scrape
  • resourceTypes - a subset of resource types to generate and save to the specified path. Headless Chrome will be used to generate these resources
  • targetPath - the absolute path to the directory in which to save the specified resources to. The directory will be created if it doesn't already exist

Response is a list of resource objects with type, extension and the absolute path to the target location on the filesystem:

[
  {
    "resourceType": "SCREENSHOT",
    "targetPath": "/where/to/save/resources/screenshot.png",
    "extension": "png"
  },
  {
    "resourceType": "PREVIEW",
    "targetPath": "/where/to/save/resources/preview.jpg",
    "extension": "jpg"
  },
  {
    "resourceType": "THUMBNAIL",
    "targetPath": "/where/to/save/resources/thumbnail.jpg",
    "extension": "jpg"
  },
  {
    "resourceType": "DOCUMENT",
    "targetPath": "/where/to/save/resources/document.pdf",
    "extension": "pdf"
  },
  {
    "resourceType": "PAGE",
    "targetPath": "/where/to/save/resources/page.html",
    "extension": "html"
  },
  {
    "resourceType": "READABLE_TEXT",
    "targetPath": "/where/to/save/resources/readable_text.txt",
    "extension": "txt"
  },
  {
    "resourceType": "READABLE_DOC",
    "targetPath": "/where/to/save/resources/readable_doc.html",
    "extension": "html"
  }
]