/Data-Collection

Full web-scraping and data analysis project using Mars news and Mars weather data.

Primary LanguageJupyter Notebook

Data-Collection-&-Web-Scraping

Mars News and Mars Weather Data

Table of Contents

Background

Taking on a full web-scraping and data analysis project, the following was completed: identifying HTML elements, identifying id and class attributes and extracting information via automated browsing with Splinter and HTML parsing with Beautiful Soup. As well, various types of information was scraped, including HTML tables and recurring elements, like multiple news articles on a webpage.

The following challenge encompasses data collection, organizing and storing data, analyzing data and visualizations.

Method

There are two technical products with the submission of the following deliverables:

  • Deliverable 1: Scrape titles and preview text from Mars news articles.

  • Deliverable 2: Scrape and analyze Mars weather data, which exists in a table.

Part 1: Scrape Titles and Preview Text from Mars News

In Jupyter Notebook scrape the Mars News website and complete the following:

1. Use automated browsing to visit the Mars news siteLinks to an external site. Inspect the page to identify which elements to scrape.

2. Create a Beautiful Soup object and use it to extract text elements from the website.

3. Extract the scraped titles and preview text of the news articles. Store the scraping results in Python data structures as follows:
        * Store each title-and-preview pair in a Python dictionary and, give each dictionary two keys: title and preview.
        * Store all the dictionaries in a Python list.
        * Print the list in the notebook.

4. Optionally, store the scraped data in a file by exporting the scraped data to a JSON file.

Part 2: Scrape and Analyze Mars Weather Data

In Jupyter Notebook scrape and analyze Mars weather data and complete the following:

1. Use automated browsing to visit the Mars Temperature Data SiteLinks to an external site. Inspect the page to identify which elements to scrape. Note that the URL is https://static.bc-edx.com/data/web/mars_facts/temperature.html.

2. Create a Beautiful Soup object and use it to scrape the data in the HTML table. This was also achieved by using the Pandas read_html function. 

3. Assemble the scraped data into a Pandas DataFrame. The columns should have the same headings as the table on the website. 
        * id: the identification number of a single transmission from the Curiosity rover
        * terrestrial_date: the date on Earth
        * sol: the number of elapsed sols (Martian days) since Curiosity landed on Mars
        * ls: the solar longitude
        * month: the Martian month
        * min_temp: the minimum temperature, in Celsius, of a single Martian day (sol)
        * pressure: The atmospheric pressure at Curiosity's location

4. Examine the data types that are currently associated with each column and convert the data to the appropriate datetime, int, or float data types.

5. Analyze the dataset by using Pandas functions to answer the following questions:
        * How many months exist on Mars?
        * How many Martian (and not Earth) days worth of data exist in the scraped dataset?
        * What are the coldest and the warmest months on Mars (at the location of Curiosity)? To answer this question:
                * Found the average minimum daily temperature for all of the months.
                * Plot the results as a bar chart.
        * Which months have the lowest and the highest atmospheric pressure on Mars? To answer this question:
                * Found the average daily atmospheric pressure of all the months.
                * Plot the results as a bar chart.
        * About how many terrestrial (Earth) days exist in a Martian year? To answer this question:
                * Considered how many days elapse on Earth in the time that Mars circles the Sun once.
                * Visually estimated the result by plotting the daily minimum temperature.

6. Export the DataFrame to a CSV file.

References

The Mars News website is operated by edX Boot Camps LLC for educational purposes only. The news article titles, summaries, dates, and images were scraped from NASA's Mars News website in November 2022. Images are used according to the JPL Image Use Policy, courtesy NASA/JPL-Caltech.

  • Dataset provided by edX UofT Data Analytics, which had been generated by Trilogy Education Services, LLC.