A versatile web scraper built with Flask, BeautifulSoup, and SQLite. This tool allows users to define scraping tasks, which can be stored in a database and executed via a simple API. It also includes functionality for one-time scraping tasks and JSON data filtering using JMESPath.
- CRUD Operations: Create, read, update, and delete scraping definitions in a SQLite database.
- Dynamic Scraping: Execute scraping tasks based on stored definitions.
- JSON Data Filtering: Use JMESPath queries to filter and transform scraped data.
- One-time Execution: Test scraping tasks without saving definitions.
- Insert and Execute: Add a new scraping definition and execute it in a single step.
-
Clone the Repository:
bashCopy code
git clone [repository-url]
-
Install Dependencies:
bashCopy code
pip install flask beautifulsoup4 requests sqlite3 jmespath
-
Run the Application:
bashCopy code
python app.py
-
Create a New Scraping Definition:
-
Endpoint:
/definition
-
Method:
POST
-
Payload:
jsonCopy code
{ "endpoint": "example_scrape", "url": "https://example.com", "element_selector": ".example-class", "config": { "method": "GET", "headers": { "User-Agent": "Mozilla/5.0" } }, "filter_expression": "expression" // Optional JMESPath expression }
-
-
Update an Existing Definition:
- Endpoint:
/definition/[id]
- Method:
PUT
- Endpoint:
-
Delete a Definition:
- Endpoint:
/definition/[id]
- Method:
DELETE
- Endpoint:
-
Retrieve All Definitions:
- Endpoint:
/getdefs
- Method:
GET
- Endpoint:
-
Execute a Defined Scraping Task:
- Endpoint:
/scrape/[endpoint]
- Method:
GET
- Endpoint:
-
One-time Execution:
- Endpoint:
/test
- Method:
POST
- Payload: Same as the create operation, but not stored in the database.
- Endpoint:
-
Insert and Execute:
- Endpoint:
/insertexecute
- Method:
POST
- Payload: Same as the create operation, but executes immediately after insertion.
- Endpoint:
-
Creating a Scraping Task:
bashCopy code
curl -X POST http://localhost:5000/definition -d '{ "endpoint": "wiki_scrape", "url": "https://en.wikipedia.org/wiki/Main_Page", "element_selector": "#mp-upper .mp-h2", "config": { "method": "GET", "headers": { "User-Agent": "Mozilla/5.0" } } }'
-
Executing a Scraping Task:
bashCopy code
curl http://localhost:5000/scrape/wiki_scrape
-
One-time Scraping Execution:
bashCopy code
curl -X POST http://localhost:5000/test -d '{ "url": "https://example.com", "element_selector": ".example-class", "config": { "method": "GET" } }'
JMESPath expressions can be used to filter and transform the JSON data returned by a scrape. For more information on JMESPath syntax, visit JMESPath Tutorial.