The purpose of this repository is to have different content scraping utility functions and codes to make workflow easier to implement.
Repositry structure is as follows:
.
├── examples
│ └── for_cnn.py
├── Makefile
├── output
├── README.md
├── requirements.txt
├── src
│ ├── __init__.py
│ └── utility_functions.py
└── tests
└── utility_functions_test.py
4 directories, 7 files
- Codes are done in python 3.11.1
- libraries involved
beautifulsoup4==4.12.3
lxml==5.1.0
requests==2.31.0
soupsieve==2.5
urllib3==2.2.0
certifi==2024.2.2
charset-normalizer==3.3.2
idna==3.6
$ virtualenv .venv --python=python3.11
$ source .venv/bin/activate
$ pip install -r requirements.txt
or Using Make
$ make init
$ make cnn
Used pythons unittest module to prepare some test.
$ python3 -m unittest -v tests/*.py -v;
or using Make
$ make test
The purpose of choosing CNN is because it seems straight forward and without any issue involved with rendering the conents via JS. If that was the case and I could give enough time I might have choosen other ways to handle it.
- CNN sites has standard structure to rely with my code until it changed.
- Tried to handle canonical url to not to overwrite output file of same link.
- Reusuable code in
src/
- Working version of other scraping factors involved PDF, Video, html table etc.
- Broken branch with PDF extraction code in feature/pdf-1 using PyPDF2 but it seems not fully relibale now.
- Time obstructed me to explore with Video content.