Source: https://medium.com/data-science-library/amazing-data-science-publications-on-medium-17c5ba862774
List of Top DS Publications
- Towards Data Science: https://towardsdatascience.com
- Learn Data Science: https://blog.exploratory.io
- Applied Data Science: https://medium.com/applied-data-science
- Center for Data Science: https://medium.com/center-for-data-science
List of Top ML Publications
- Learning New Stuff: https://medium.com/learning-new-stuff
- Machine Learnings: https://machinelearnings.co/
- Machine Learning for Humans: https://medium.com/machine-learning-for-humans
- ML Review: https://medium.com/mlreview
- Open Machine Learning Course: https://medium.com/open-machine-learning-course
- metaflow-ai: https://blog.metaflow.fr/
- Inside Machine Learning: https://medium.com/inside-machine-learning
Bot Description
Primarily, there are two bots working here and you have to run them sequentially in the following order:
link_scrap
Collects the links of the articles of the above publications.
Running the link_scrap
bot will:
Collect all the links of the articles and store them in a local mongo database.
-
Database Name:
mediumCrawl
-
Collection Name:
mediumLinks
-
Document Format:
{ "_id" : ObjectId("5d7df79048b3c66d77c86fe3"), "articleTitle" : "Guitar-Set, a New Dataset for Music Information Retrieval", "articleLink" : "https://medium.com/center-for-data-science/guitar-set-a-new-dataset-for-music-information-retrieval-41b7861a87d7?source=collection_archive---------0-----------------------"
}
content_scrap
Collects the contents of the links aggregated via the link_scrap
bot
Running the link_scrap
bot will:
Collect the desired contents of the articles from the links (aggregated by link_scrap
bot and stored inside mediumLinks
collection ) and store them in a local mongo database.
-
Database Name:
mediumCrawl
-
Collection Name:
mediumContents
-
Document Format:
{'articleTile': 'NYU researchers invent new real-time data analysis system for ' 'humanitarian agencies', 'content': ['How can we tame the dragon', 'What is coming in future... '], 'nameOfAuthor': 'NYU Center for Data Science', 'nameOfPublication': 'Center for Data Science', 'postingTime': '2018-01-24T15:19:52.202Z'}
Run the Bots
-
Create a python virtual environment and install the requirements via
pip install -r requirements.txt
link_scrap
1. To run the first bot,
-
Install
mongodb
-
Run the local
mongo
server -
cd
tolink_scrap
directory -
Run
$ scrapy crawl link_scrap
-
You should see a
links.jl
file in your current directory and the documents inmediumLinks
collection insidemediumCrawl
database -
You can count the number of documents saved in your local mongo database via running the following command in your
mongo shell
:> db.mediumLinks.countDocuments({})
content_scrap
2. To run the second bot,
-
Keep your
mongo
server running -
cd
tocontent_scrap
-
Run
$ scrapy crawl content_scrap
-
You should a
contents.jl
file in your current directory and the documents inmediumContents
collection insidemediumCrawl
database -
You can count the number of documents saved in your local mongo database via running the following command in your
mongo shell
:> db.mediumContents.countDocuments({})
-
To see a sample document, run
db.mediumContents.findOne()
Quirks
-
In the default settings, the bots will only collect 2018-2019, January - March data
-
To increase data collection range or add new sources, add them in the in
link_scrap/link_scrap/spiders/sources.py
file
Issues
- Article title (collected by
link_scrap
) scraping is not perfect, there are a lot of null values for article title here and there (this is not a big deal since we are collecting the titles again while running thecontent_scrap
bot )