/Utility-Text-Scrapper

The prime objective of this repository is to facilitate the extraction of texts from web URLs and also parallelize the extraction process so as to reduce the processing time

Primary LanguagePythonMIT LicenseMIT

Utility-Text-Scrapper

The prime objective of this repository is to facilitate the extraction of texts from web URLs and also parallelize the extraction process so as to reduce the processing time

Table of Contents

General Information

  • The prime objective of this repository is to help developers / researchers / data enthusiasts in validating a list of URLs, classify them as active, inactive and unreachable links and also provide a way to scrap all textual content within those active URL links.
  • Use built in python libaries and apply concurrent workers (asyncronous operation) to perform validation as well as scrapping operations which would result in speeding up both these operations.
  • The validation process makes use of requests library whereas scrapping process is supported by Trafilatura and Beautiful Soup libaries. The features of these libraries are mentioned below if you wnat to know as to why these were choosen.

Technologies Used

  • Python
  • Trafilatura
  • Beautiful Soup
  • Concurrent Tasks

Features

  • General features:
    1. Leverage 3rd party libraries and helps in optimum level of text scrapping even if any one of the library fails to perform.
    2. Asynchronous URL validation and text scrapping resulting in reduced operational time.
  • Trafilatura features (as per official documentation of Trafilatura):
    1. Web crawling and text discovery
      • Focused crawling and politeness rules
      • Support for sitemaps (TXT, XML) and feeds (ATOM, JSON, RSS)
      • URL management (blacklists, filtering and de-duplication)
    2. Seamless and parallel processing, online and offline
      • URLs, HTML files or parsed HTML trees usable as input
      • Efficient and polite processing of download queues
      • Conversion of previously downloaded files
    3. Robust and efficient extraction
      • Main text (with LXML, common patterns and generic algorithms: jusText, fork of readability-lxml)
      • Metadata (title, author, date, site name, categories and tags)
      • Formatting and structural elements: paragraphs, titles, lists, quotes, code, line breaks, in-line text formatting
      • Comments (if applicable)
    4. Output formats
      • Text (minimal formatting or Markdown)
      • CSV (with metadata, tab-separated values)
      • JSON (with metadata)
      • XML (with metadata, text formatting and page structure) and TEI-XML
    5. Optional add-ons
      • Language detection on extracted content
      • Graphical user interface (GUI)
      • Speed optimizations

Screenshots

Setup:

  • git clone https://github.com/ManashJKonwar/Utility-Text-Scrapper.git (Clone the repository)
  • python3 -m venv scrappingVenv (Create virtual environment from existing python3)
  • activate the "scrappingVenv" (Activating the virtual environment)
  • pip install -r requirements.txt (Install all required python modules)

Usage

Project Status

Project is: _in progress

Room for Improvement

References

[1] XXX - Paper Link
[2] YYY - Paper Link

Acknowledgements

Contact

Created by @ManashJKonwar - feel free to contact me!