/Scrapedeep

Web Content Scraper - Scrapeheap replica but mine own

Primary LanguagePHP

Scrapedeep

Originally from Dan Devine. Scrapedeep is a replica of Scrapeheap - A Web Content Scraping Tool

 

Overview

Get the url of the site you'd like to scrape content for. Content files will be generated for the pages of that site. The main body of text for the content is dumped WITHOUT any formatting (e.g. line breaks, font sizes, font styles ... etc.)

This project is suited for locally hosted websites. For instance, if you aim to alter the overall appearance of your site while keeping its content unchanged, this tool extracts the text content for you.

IMPORTANT: ensure you've disabled basic auth for the site you're scraping otherwise, this scraper won't work for it.

 

Instructions

Right now this version of scrapedeep is at it's infancy. Don't expect any fancy user interface. Just pop a URL in, pick what file type your content is going to be thrown into and expect results. These will be thrown into a folder called output/. Simple as that.

 

Latest Updates

  1. See the dump of content as your scraper works
  2. Saves HTML & MD files in separate folders
  3. Adds some nice helpful text so if you want to scrape again, just go ahead

 

Local Deployment

  1. Download/Clone the project
  2. Install dependencies by running composer install && npm install
  3. Ensure you put the project where your valet has been parked in
  4. Access the project locally via Valet at http://scrapedeep.test

This assumes you have Valet installed and properly configured for your project. If not, please refer to the Valet documentation for setup instructions.

 

References