GD Scrapy parser

This project is a parser for GridDynamics blog and it creates a report which contains following bullets:

Top-5 Authors,
Top-5 New Articles,
Plot with articles counter of 7 most popular tags

Crawler is based on Scrapy, with items to load data to storage. Storage is implemented as JSON file and stored in <project-dir>/src/parser/resources. Main script with report generator is <project-dir>/src/parser/report.py.

Storage contains:

Articles (blog post):

example, of articles page: https://blog.griddynamics.com/create-image-similarity-function-with-tensorflow-for-retail/

Extract to storage:

Title
url to full version
First 160 symbols of text
publication date
Author (full name)
Tags

Authors:

example of author page: https://blog.griddynamics.com/author/anton-ovchinnikov/

All authors page: https://blog.griddynamics.com/all-authors/

Extract to storage:

Full name
Job Title
Linked-in url to user profile
Counter with articles

In report:

Top-5 Authors (based on articles counter)
Top-5 New Articles (based on publish data)
Plot with counts of 7 popular tags:
- it must be a bar chart (column plot) where each column is for one tag
- Tag bar must have name in plot.
- X-axis - counter with articles of tag theme

Tests:

You can run the tests with following command: cd src/parser python3 -m unittest tests/test_crawler.py

Installation

Project requires pip3 and python3 installation. Python 3.9.1 version is preferable.

git clone git@github.com:Samarkina/gd-scrapy-parser.git
pip3 install -r requirements.txt

Also, you can use virtual env instead of Python installed on your computer. Steps for using virtual-env:

Create the virtual-env python3 -m venv <your-virtual-env-name>
Activate the virtual-env source <your-virtual-env-name>/bin/activate
After that you can install all the libraries necessary for the work.

Run the project

cd src/parser/ python3 report.py

Samarkina/gd-scrapy-parser