This project is a parser for GridDynamics blog and it creates a report which contains following bullets:
- Top-5 Authors,
- Top-5 New Articles,
- Plot with articles counter of 7 most popular tags
Crawler is based on Scrapy, with items to load data to storage.
Storage is implemented as JSON file and stored in <project-dir>/src/parser/resources
.
Main script with report generator is <project-dir>/src/parser/report.py
.
example, of articles page: https://blog.griddynamics.com/create-image-similarity-function-with-tensorflow-for-retail/
Extract to storage:
- Title
- url to full version
- First 160 symbols of text
- publication date
- Author (full name)
- Tags
example of author page: https://blog.griddynamics.com/author/anton-ovchinnikov/
All authors page: https://blog.griddynamics.com/all-authors/
Extract to storage:
- Full name
- Job Title
- Linked-in url to user profile
- Counter with articles
- Top-5 Authors (based on articles counter)
- Top-5 New Articles (based on publish data)
- Plot with counts of 7 popular tags:
- it must be a bar chart (column plot) where each column is for one tag
- Tag bar must have name in plot.
- X-axis - counter with articles of tag theme
You can run the tests with following command:
cd src/parser
python3 -m unittest tests/test_crawler.py
Project requires pip3 and python3 installation. Python 3.9.1 version is preferable.
git clone git@github.com:Samarkina/gd-scrapy-parser.git
pip3 install -r requirements.txt
Also, you can use virtual env instead of Python installed on your computer. Steps for using virtual-env:
- Create the virtual-env
python3 -m venv <your-virtual-env-name>
- Activate the virtual-env
source <your-virtual-env-name>/bin/activate
- After that you can install all the libraries necessary for the work.
cd src/parser/
python3 report.py