Web Crawling 101 - On Going Project

Wondering what this is all about? Take 2 minutes to read our short Open Code, Open Data Manifesto.

This project is structured to work as a series of classes focused on bootstrapping your data-mining / web-crawling knowledge. Some of the topics that are covered here:

Anatomy of a Crawler (Policies and Behaviors)
Understanding HTTP Requests
Scrapping / Parsing data out of HTML pages
Tooling (Frameworks and custom-made libraries)
Finding your public source of data
Modeling your objects
Storing your results
Scaling up your crawler

How do I Start ?

Keep this project Wiki open at all times, since most of the text / references will be there for you to read, while you advance through the chapters/classes of this project.

Start each chapter by going to the Wiki first, and only after reading it's text, proceed to the code.

Take your time, read the code comments, run it, modify it and run it again to understand the impact of each change.

Happy hacking :)

Setup

Install pip (using terminal/command prompt navigate to the "Setup" directory and run python get-pip.py
Reload your terminal/command prompt (open and close)
Make sure pip is installed by running: pip freeze
If it is, you can now install the needed dependencies by running from the root of the project: pip install -U -r Setup/requirements.txt

About Me

Marcello Lins is passionate about technology and crunching data for fun. Feel free to connect with me through Linkedin and find more about what I'm working at via my AboutMe Profile. Visit https://techflow.me/ for more awesomeness !

Version

0.0.5

MarcelloLins/WebCrawling101

Web Crawling 101 - On Going Project

How do I Start ?

Setup

About Me

Version