/gratka.pl_research

This project is created for crawling metadata from auction service with πŸš—cars: Gratka.pl

Primary LanguageJupyter Notebook

πŸ‘¨β€πŸ’» Built with

Descripction about project

This project is created for crawling metadata from auction service with πŸš—cars: Gratka.pl. Metadata like a:

'marka', 'model', 'cena', 'miasto', 'wojewodztwo', 'stan_techniczny', 'przebieg', 'rodzaj_ogΕ‚oszenia', 
'do_negocjacji', 'typ_nadwozia', 'stan_pojazdu', 'rok_produkcji', 'rodzaj_paliwa', 'pojemnoΕ›Δ‡_silnika_cm3', 
'moc_silnika', 'skrzynia_biegΓ³w', 'zarejestrowany_w_polsce', 'kraj_pierwszej_rejestracji', 'kolor', 
'liczba_drzwi', 'liczba_miejsc', 'numer_vin', 'waΕΌny_przeglΔ…d', 'link'

This data when spider will end his job will be stored inside the PostgreSQL. Next step will be data cleansing and visualistaion, which are made with jupyter notebook.

This project using 3 Docker containers:

  • Container with Python and Scrapy

    • Created gratka spider which inheriting from class scrapy.Spider (Scrapy script created to crawl metadata from every car selling advertisment and also saving a HTML file for each ad - in folder)
  • Container with PosgreSQL

  • Container Jupyter Notebook

    • Notebook created to cleansing and visualise a data, used libraries:
      • pandas
      • geopandas
      • numpy
      • matplotlib
      • seaborn
      • pylab
      • psycopg2

🌲 Project tree

β”œβ”€β”€ Database
β”‚   └── create_table.sql
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ gratkascrap
β”‚   β”œβ”€β”€ Dockerfile
β”‚   β”œβ”€β”€ HTML_FILES
β”‚   β”‚   └── 00a3d318-6fcd-4bb8-9bd7-6c1b7ac4d69c.html
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ gratkascrap
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ items.py
β”‚   β”‚   β”œβ”€β”€ middlewares.py
β”‚   β”‚   β”œβ”€β”€ pipelines.py
β”‚   β”‚   β”œβ”€β”€ settings.py
β”‚   β”‚   └── spiders
β”‚   β”‚       β”œβ”€β”€ __init__.py
β”‚   β”‚       β”‚   β”œβ”€β”€ __init__.cpython-39.pyc
β”‚   β”‚       β”‚   └── gratka.cpython-39.pyc
β”‚   β”‚       └── gratka.py
β”‚   β”œβ”€β”€ requirements.txt
β”‚   └── scrapy.cfg
β”œβ”€β”€ notebook
β”‚   β”œβ”€β”€ Dockerfile
β”‚   β”œβ”€β”€ data_visualisation.ipynb
β”‚   β”œβ”€β”€ requirements.txt
β”‚   β”œβ”€β”€ voivodeship.shp
β”‚   └── voivodeship.shx
└── .env-sample

πŸ”‘ Setup your local variables

To run properly this project you should assign a environmental variables in file .env.

In this repo is created .env-sample with variables used to run containers. You need to assign variables below in your .env file:

DATABASE_PASSWORD=
JUPYER_TOKEN=
POSTGRES_DB=

βš™οΈ Run Locally

  • Clone the project
  • Go to the project directory: Type in CLI:
  $ ls

You should see this:

Database    docker-compose.yml	gratkascrap     notebook

Change direction to create dockerfile of scrapy:

  $ cd gratkascrap

Build scrapy image: 🚨to run this command docker should be running on your machine🚨

  $ docker build -t scrapy_gratka .     

Change direction to main directory:

  $ cd ..

Change direction to create dockerfile of notebook:

  $ cd notebook

Build notebook image:

  $ docker build -t notebook_gratka .     

Change direction to run docker composer:

  $ cd ..

Run dockercomposer:

  $ docker-compose up

πŸ“ŠData cleansing and visualisation

Now all three containers are running, it will take about 15-20 minutes for scrapy to crawl all pages, you will see in terminal when scrapy finishes job. When scrapy finished you should open jupyter lab via localhost, type in your browser:

  localhost:8888

🚨In case the notebook requires a token pass a value which was assigned to JUPYER_TOKEN in .env🚨

Next choose a file πŸ—’οΈdata_visualisation.ipynb and run all cells to see data analyse.