/content_scraping_utils

Repository with numerous utilities to scrape the contents.

Primary LanguagePython

Content Scraping Utils

The purpose of this repository is to have different content scraping utility functions and codes to make workflow easier to implement.

Introduction of the repository

Repositry structure is as follows:

.
├── examples
│   └── for_cnn.py
├── Makefile
├── output
├── README.md
├── requirements.txt
├── src
│   ├── __init__.py
│   └── utility_functions.py
└── tests
    └── utility_functions_test.py

4 directories, 7 files

How?

Prerequisite and Environment

  • Codes are done in python 3.11.1
  • libraries involved
beautifulsoup4==4.12.3
lxml==5.1.0
requests==2.31.0
soupsieve==2.5
urllib3==2.2.0
certifi==2024.2.2
charset-normalizer==3.3.2
idna==3.6

Installation Steps

Prepare environment with virtualenv

$ virtualenv .venv --python=python3.11

Activate the Virtualenv

$ source .venv/bin/activate

Install required Libraries

$ pip install -r requirements.txt

or Using Make

$ make init

Run Examples

$ make cnn

Run Test?

Used pythons unittest module to prepare some test.

$ python3 -m unittest -v tests/*.py -v;

or using Make

$ make test

Why CNN?

The purpose of choosing CNN is because it seems straight forward and without any issue involved with rendering the conents via JS. If that was the case and I could give enough time I might have choosen other ways to handle it.

  • CNN sites has standard structure to rely with my code until it changed.
  • Tried to handle canonical url to not to overwrite output file of same link.

Prospect

  • Reusuable code in src/
  • Working version of other scraping factors involved PDF, Video, html table etc.
  • Broken branch with PDF extraction code in feature/pdf-1 using PyPDF2 but it seems not fully relibale now.
  • Time obstructed me to explore with Video content.