/NewsExtractor

API designed to extract large amounts of articles from any URL or website supported the use of CSS selectors documented with Swagger (OpenAPI 3).

Primary LanguageJavaGNU General Public License v3.0GPL-3.0

NewsExtractor v 1.0.0

API dedicated to extract news using CSS query selectors documented with OpenAPI 3.

About

The API uses a pattern using CSS query selectors to extract news using a maximum of three phases, which can be reduced as needed.

This API used the following dependencies:

Pre-requisites

Usage

  • Important: Check the ApplicationConfig.java, because the database-config use environment variables.

Install the prerequisites, then in the folder of the project:

- mvn clean install
- java -cp target/NewsExtractor.jar

Next, you check This endpoint

Features:

- You can use a add specific sections.
- You can use a add specific articles or article sources.
- You can search for words in common in all newspaper articles.
- If the page to be extracted does not have connection problems,
  it is possible to extract and save up to 0.80 seconds per article.

Note:

Minimal use requires a newspaper source and a pattern with at least one specific selector.

FAQ:

  • Does the API work completely? Due to the fact that tests have not yet been integrated, some pages without the correct adjustment, it will not be possible to completely extract all the news for now because some pages limit requests or have sources that take a long time to return a response.

  • If you have a suggestion or advice, feel free to send me an email.