Run npm run dev
to start the development server. And open the web interface. At the end run npm run build:post
to move the files to the dist folder.
A configurable web scraper with a visual interface to monitor and control the scraping process.
- Install dependencies:
npm install
- Start the application:
npm run dev
This launches both the server (port 3000) and client interface.
The scraper operates using 4 sequential queues:
- Request Queue: Validates and initiates new URL requests
- Fetch Queue: Downloads content from validated URLs
- Parse Queue: Processes downloaded content
- Write Queue: Saves processed content to disk
The frontend interface allows you to:
- Start a new scrape: "Download to Cache" button
- Write cached content: "Write from Cache to Output" button
- Monitor active jobs in each queue
- View job history and details
- Filter and search through completed jobs
- Clear history and cache
Each queue module performs a specific task and can be configured. The queues process jobs sequentially, with each module performing specific validation or transformation tasks. Failed jobs can be monitored and retried through the frontend interface.
The following modules are available:
isDomainValid
: Checks if URL matches allowed domainsisPathValid
: Validates URL path against rulesisAlreadyRequested
: Prevents duplicate requestsaddFetchJob
: Adds URL to fetch queue
isCached
: Checks if content already exists in cachefetchHttp
: Downloads content from URLaddParseJob
: Queues content for parsing
guessMimeType
: Determines content typeparseFiles
: Processes content based on type
isAlreadyWritten
: Prevents duplicate writeshandleRedirected
: Manages redirected URLswriteOutput
: Saves processed content to disk