Scraping component for the Lynks project to extract content from specified webpages. Provides a simple web API to generate a number of resources or page metadata from a given URL.
- extract thumbnail and preview images from headless Chrome sessions
- generate full page screenshots and PDF documents from pages
- extract key page content (articles) and generate 'readable' versions
- extract metadata from a given page, including title, keywords, description, author, publish date etc
- Express - minimalist web framework
- Puppeteer (and plugins from extra) - headless Chrome sessions
- JSDom and DomPurify - create and sanitize DOM's in Node.js
- Readability - create 'reader view' versions of webpages without the clutter
- Sharp - fast image resizing
- Winston - logging
- Jest and Supertest - testing
Pass in as payload into each endpoint to define which resources should be generated:
SCREENSHOT
- full page screenshot asPNG
THUMBNAIL
- primary extracted image from page or small screenshot as320x180
JPG
PREVIEW
- partial page screenshot from primary image or headless session as640x360
JPG
PAGE
- fullHTML
page after sanitizationDOCUMENT
- full pagePDF
READABLE_DOC
- reader view of page asHTML
(with images)READABLE_TEXT
- reader view of page as text only
Extract key metadata and simple preview/thumbnail to support the Lynks suggest functionality. This endpoint is user-facing and so does not use headless Chrome for performance purposes. Instead, the full page is downloaded as HTML and then analyzed. As no session is created, the thumbnail and preview images are extracted from the primary page image (if present). Only a subset of the overall resource types are available.
POST /api/suggest
--> generate a suggestion and number of resources from a given URL
{
"url": "https://foojay.io/today/demystifying-jvm-memory-management/",
"resourceTypes": [
"PREVIEW",
"THUMBNAIL",
"READABLE_TEXT"
],
"targetPath": "/where/to/save/resources"
}
url
- the URL to generate a suggestion forresourceTypes
- a subset of resource types to generate and save to the specified path. Headless Chrome is not used to generate these resources. OnlyPREVIEW
,THUMBNAIL
andREADABLE_TEXT
types are availabletargetPath
- the absolute path to the directory in which to save the specified resources to. The directory will be created if it doesn't already exist
Response is a details
object containing metadata alongside list of resource
objects with type, extension and the
absolute path to the target location on the filesystem:
{
"details": {
"url": "https://deepu.tech/memory-management-in-jvm/",
"title": "Demystifying Java Virtual Machine Memory Management",
"keywords": [
"java",
"memory",
"management"
],
"description": "I aim to demystify the concepts behind memory management and take a look at memory management in some of the modern programming languages.",
"image": "https://i.imgur.com/Kv9ichJ.gif",
"author": "Deepu K Sasidharan",
"published": "2021-05-20T07:26:37+00:00"
},
"resources": [
{
"resourceType": "PREVIEW",
"targetPath": "/where/to/save/resources/preview.jpg",
"extension": "jpg"
},
{
"resourceType": "THUMBNAIL",
"targetPath": "/where/to/save/resources/thumbnail.jpg",
"extension": "jpg"
},
{
"resourceType": "READABLE_TEXT",
"targetPath": "/where/to/save/resources/readable_text.txt",
"extension": "txt"
}
]
}
Perform a full scrape of the target URL using a headless Chrome session. Output a set of resources, including screenshots, readable versions, previews and PDF's to the provided target path.
POST /api/scrape
--> scrape the target URL and generate a number of resources
{
"url": "https://foojay.io/today/demystifying-jvm-memory-management",
"resourceTypes": [
"SCREENSHOT",
"PREVIEW",
"DOCUMENT",
"READABLE_TEXT",
"READABLE_DOC",
"PAGE",
"THUMBNAIL"
],
"targetPath": "/where/to/save/resources"
}
url
- the URL to scraperesourceTypes
- a subset of resource types to generate and save to the specified path. Headless Chrome will be used to generate these resourcestargetPath
- the absolute path to the directory in which to save the specified resources to. The directory will be created if it doesn't already exist
Response is a list of resource
objects with type, extension and the absolute path to the target location on the
filesystem:
[
{
"resourceType": "SCREENSHOT",
"targetPath": "/where/to/save/resources/screenshot.png",
"extension": "png"
},
{
"resourceType": "PREVIEW",
"targetPath": "/where/to/save/resources/preview.jpg",
"extension": "jpg"
},
{
"resourceType": "THUMBNAIL",
"targetPath": "/where/to/save/resources/thumbnail.jpg",
"extension": "jpg"
},
{
"resourceType": "DOCUMENT",
"targetPath": "/where/to/save/resources/document.pdf",
"extension": "pdf"
},
{
"resourceType": "PAGE",
"targetPath": "/where/to/save/resources/page.html",
"extension": "html"
},
{
"resourceType": "READABLE_TEXT",
"targetPath": "/where/to/save/resources/readable_text.txt",
"extension": "txt"
},
{
"resourceType": "READABLE_DOC",
"targetPath": "/where/to/save/resources/readable_doc.html",
"extension": "html"
}
]