/scradium

Scraping blog posts from top data science publications on Medium

Primary LanguageJulia

alt-text

Source: https://medium.com/data-science-library/amazing-data-science-publications-on-medium-17c5ba862774

List of Top DS Publications

List of Top ML Publications

Bot Description

Primarily, there are two bots working here and you have to run them sequentially in the following order:

link_scrap

Collects the links of the articles of the above publications.

Running the link_scrap bot will:

Collect all the links of the articles and store them in a local mongo database.

  • Database Name: mediumCrawl

  • Collection Name: mediumLinks

  • Document Format:

    {
    "_id" : ObjectId("5d7df79048b3c66d77c86fe3"),
    "articleTitle" : "Guitar-Set, a New Dataset for Music Information Retrieval",
    "articleLink" : "https://medium.com/center-for-data-science/guitar-set-a-new-dataset-for-music-information-retrieval-41b7861a87d7?source=collection_archive---------0-----------------------"

}

content_scrap

Collects the contents of the links aggregated via the link_scrap bot

Running the link_scrap bot will:

Collect the desired contents of the articles from the links (aggregated by link_scrap bot and stored inside mediumLinks collection ) and store them in a local mongo database.

  • Database Name: mediumCrawl

  • Collection Name: mediumContents

  • Document Format:

    {'articleTile': 'NYU researchers invent new real-time data analysis system for '
                'humanitarian agencies',
    'content': ['How can we tame the dragon', 'What is coming in future... '],
    'nameOfAuthor': 'NYU Center for Data Science',
    'nameOfPublication': 'Center for Data Science',
    'postingTime': '2018-01-24T15:19:52.202Z'}

Run the Bots

  • Create a python virtual environment and install the requirements via

    pip install -r requirements.txt
    

1. link_scrap

To run the first bot,

  • Install mongodb

  • Run the local mongo server

  • cd to link_scrap directory

  • Run

    $ scrapy crawl link_scrap
    
  • You should see a links.jl file in your current directory and the documents in mediumLinks collection inside mediumCrawl database

  • You can count the number of documents saved in your local mongo database via running the following command in your mongo shell:

    > db.mediumLinks.countDocuments({})
    

2. content_scrap

To run the second bot,

  • Keep your mongo server running

  • cd to content_scrap

  • Run

    $ scrapy crawl content_scrap
    
  • You should a contents.jl file in your current directory and the documents in mediumContents collection inside mediumCrawl database

  • You can count the number of documents saved in your local mongo database via running the following command in your mongo shell:

    > db.mediumContents.countDocuments({})
    
  • To see a sample document, run

    db.mediumContents.findOne()
    

Quirks

  • In the default settings, the bots will only collect 2018-2019, January - March data

  • To increase data collection range or add new sources, add them in the in link_scrap/link_scrap/spiders/sources.py file

Issues

  • Article title (collected by link_scrap) scraping is not perfect, there are a lot of null values for article title here and there (this is not a big deal since we are collecting the titles again while running the content_scrap bot )