/mission-to-mars

Web-scraping with Python and delivering data in a Flask App.

Primary LanguageJupyter NotebookMIT LicenseMIT

Mission to Mars

Web-scraping with Celery Task and delivering data in a Flask App.

Mars

Photo by Nicolas Lobos on Unsplash

Web Scraping

Web scraping simplifies the process of extracting data from sources where APIs are not available. Scraping speeds up the data gathering process by automating steps and creating easy accessible scrapped data in many formats including CSV, JSON, or raw text.

Basically, web scraping saves you the trouble of manually downloading or copying any data and automates the whole process with programmatic tools.

Web Scraping Tools:

  • Chrome Driver
  • Splinter
  • BeautifulSoup
  • Pandas (forms)
  • Mongo
  • Celery
  • Flask
  • Bootstrap 4
  • jQuery

Web Scraping Tools I'm interested in:

  • selenium but still using Chrome Driver
  • requests
  • regex

Scrape Mars News

Featured Image

Mars Facts

  • Visit the Mars Facts webpage and use Pandas' read_html method to grab the page's table
  • Use Pandas' to_html method to convert the table data to a HTML table with classes

Mars Hemispheres

  • Using well organized directories with comments
    • /.vscode # Proper testing and source code editing
    • /notebooks # Holding practice.ipynb, mission_to_mars.ipynb & mission_to_mars_challenge.ipynb
    • /resources # Screenshots that the app works.
    • /static # css, js, fonts, etc
    • /templates # Jinja2 HTML template files with Bootstrap 4
    • .editorconfig # .editorconfig to power some workflows in VS Code.
    • .gitignore # .gitignore
    • app.py # The Module's Flask app but running with celery
    • scraping.py # scrape_all function that decorated with a celery.task function

Storing Data

We worked on our scraping script within Jupyter Notebook and then exported the code into a Python Script called scraping.py. With a function called scrape_all that decorated with a celery.task function. MongoDB is used for persistence and as a broker for Celery. I really wanted to use Celery for this module seeing that many scripts that we would be running are computational taxing and could be awesome to have a task queue to break up work.

Serving Data

Usage

Running this app requires you to have a mongo, celery, and flask server running at the same time.

PS ~/mission-to-mars/> mongod
PS ~/mission-to-mars/> celery -A app.celery worker --pool=solo -l info
PS ~/mission-to-mars/> $env:FLASK_APP = 'app'
PS ~/mission-to-mars/> $env:FLASK_ENV = 'development'
PS ~/mission-to-mars/> python app.py

Routes

/ # index route /longtask # jQuery route that starts the Celery Long Task /status/<task_id> # jQuery route that checks on the status of the task

Data Powering the Web app

  • MongoDB is used for persistence and as a broker for Celery. With Flask and HTML we display all of the information that was scraped from the URLs and stored in Mongo.

Code and Structure to deploy scripts and tasks within a task queue paradigm and data pipeline for ETL purposes.

I need an application setup that uses celery + Mongo for task distribution and workers management. The web application also uses triggers to kick off tasks to celery for inserting data into a database. These scripts could even use pandas and run as python modules that are fully integrated with celery as individual task workers.

Celery tasks with X's query payload. This Celery tasks after performing certain operations submit jobs to another server where DB inserting celery is working and start waiting for other server tasks to be completed and response to be received.

Features

  • Celery is a simple, flexible, and reliable distributed system to process vast amounts of messages, while providing operations with the tools required to maintain such a system.

Workings

Broker The Broker (Python/Mongo) is responsible for creating task queues, dispatching tasks to task queues according to some routing rules, and then delivering assignments from task queues to workers.

Consumer (Celery Workers) The Consumer is the one or multiple Celery workers executing the tasks. You could start many workers depending on your use case.

Result Backend The Result Backend is used for storing the results of your tasks. In this case, the API response to a database.

Todo Checklist

A helpful checklist to gauge how your README is coming on what I would like to finish:

  • PYTHON REQUIREMENTS FILE! pipenv?!?
  • Update the UI/UX.
  • jQuery needs work.
  • State management and logic with the Fetch button

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

MIT