
14D004 Scraping Project: Scrape all the available courses on datacamp.com and scape all job posts on jobsinbarcelona.es using scrapy

Primary LanguagePython

14D004 Scraping Project

Project Description

The data and code in this repository allows users to scrape all the available courses on datacamp.com and scape all job posts on jobsinbarcelona.es using scrapy an open source and collaborative framework for extracting the data you need from websites.

  • The code was written Python 3.6 and Scrapy 1.5.1

On the datacamp course page itself, you can search for courses of interest or browse all the courses by technology.


The datacamp.py script extracts all of the course titles within these six technologies, along with their course description, author, authors occupation and URL.


Jobs in Barcelona is a platform of tech orientated jobs in Barcelona.


The jobsinbarcelona.py script scrapes all of the job listings along with the company, location, published date, job source and URL.


On the datacamp instructors page, you can find the details of all of the various course instructors.


The datacamp_instruct.py script extracts all of the instructor's titles along with their subscriber count, occupation and URL. Furthermore, the script extracts their personal descriptions from their "Full Bios" (see example below).



  • datacamp: Scrapy datacamp project stored here
  • jobsinbarcelona: Scrapy jobsinbarcelona project stored here
  • datacamp_instructors: Scrapy datacamp instructors project stored here

Each of which is a directory with the following contents (datacamp used for example):

    scrapy.cfg            # deploy configuration file
    datacamp.csv          # scaped data exported as .csv
    datacamp.json          # scaped data exported as .json

    datacamp/             # project's Python module, you'll import your code from here

        items.py          # project items definition file (not used)

        middlewares.py    # project middlewares file (not used)

        pipelines.py      # project pipelines file (not used)

        settings.py       # project settings file (not used)

        spiders/          # a directory with the spiders
            datacamp.py   # This is the code for our datacampe Spider


Installing Scrapy

Install the latest version of Scrapy (I recommend using Anaconda)

  • Anaconda distribution
conda install scrapy
  • PyPI
pip install scrapy

How to run the Spiders

To put the spiders to work, go to the relevant project’s top-level directory (i.e. datacamp, jobsinbarcelona or datacamp_instructors) and run:

scrapy crawl datacamp


scrapy crawl jobsinbarcelona


scrapy crawl datacamp_instructors

Storing the scraped data

The simplest way to store the scraped data is by using Feed exports, with the following command:

scrapy crawl datacamp -o datacamp.csv


scrapy crawl jobsinbarcelona -o jobsinbarcelona.csv


scrapy crawl datacamp_instruct -o datacamp_instructors.csv

That will generate a datacamp.csv, jobsinbarcelona.csv and datacamp_instructors.csv file containing all the scraped items.

You can also use other formats, like JSON:

scrapy crawl datacamp -o datacamp.json

Note: for historical reasons, Scrapy appends to a given file instead of overwriting its contents. If you run this command twice without removing the file before the second time, you’ll end up with a broken file.