/southampton_data_science_technical

Fundamentals of Data Science (Technical) at Southampton Data Science Academy (July 2018 to August 2018)

Primary LanguageJupyter Notebook

Fundamentals of Data Science (Technical)

All notebooks and material from the Southampton data science fundamentals (technical) course (July - August 2018). The assignments folder contains my completed work that resulted in an overall score of 100.

Recent update: Bokeh.charts is now deprecated

Course Access

For information

The course runs over 6 weeks and is broken down into manageable weekly topics:

Week 1: Welcome and course information

  • Welcome and introduction to the course
  • What data science is and why it’s important
  • A ‘hands-on’ Jupyter familiarisation activity
  • Python Primer
  • Glossary of terminology

Week 2: Introduction to core concepts and technologies

  • The data science process
  • A data science toolkit
  • Types of data and example applications

Week 3: Data collection and management

  • Sources of data
  • Data collection and APIs
  • Exploring and fixing data
  • Data storage and management
  • Using multiple data sources

Week 4: Data analysis

  • Introduction to statistics
  • Basic machine-learning algorithms

Week 5: Data visualisation

  • Types of data visualisation
  • Data for visualisation
  • Technologies for visualisation

Week 6: Future of data science

An exploration of the future of data science. After successfully completing the course, you’ll be able to:

  • Understand key concepts in data science and their real-world applications
  • Explain how data is collected, managed and stored in the context of data science
  • Implement data collection and management scripts using MongoDB
  • Demonstrate an understanding of statistics and machine-learning concepts vital for data science
  • Produce Python code to statistically analyse a dataset
  • Plan and generate visualisations from data using tools such as Python and Bokeh
  • Work effectively with live data and utilise the opportunities presented by cloud services

Assignments

Assignment 1: Data management

T his week, you will be required to implement the first stages of a data processing pipeline, which will involve the transformation, storage and querying of data.

To help you prepare for this week's assignment we have provided you with two guided exercises with worked answers and these are again accompanied by video walkthroughs: 'HTML & page scraping' and 'Using MongoDB to retrieve information'.

In this week's assignment you will go through the process of obtaining data, cleaning it, and then querying it from a database. You will be provided with an open dataset containing food hygiene inspection results for businesses across London. Each record will contain information about the business’s location, date of inspection, and inspection results. You must construct a data processing pipeline using a MongoDB datastore, and run and compute a set of queries which will be given to you.

All code should be written in Python. Again, you will be using the Jupyter working environment to do this.

By the end of this assignment, you should be able to demonstrate that you can:

Apply data collection techniques to store and manage data in MongoDB

Write queries to extract data from MongoDB

Assignment 2: statistics and machine learning

Instructions

As in week 3, access the Statistics and Machine Learning assessment via Jupyter. The assessment is inside the 4. Statistics and Machine Learning directory.

You may need to refer to the instructions from Jupyter: your working environment to access this assignment.

If you have not done so in the previous step, you will need to 'Fetch' Statistics and Machine Learning to create a copy of it which will be stored in your allocated 'home directory'. The file 4. Statistics and Machine Learning Assessment.ipynb is the assessed notebook.

Task

In this week's assignment, you will use the Python libraries Pandas and NumPy to perform basic operations on the data relating to food hygiene in Wandsworth. In addition you will also perform a linear regression on a dataset about vehicle mileage and price (similar to the examples given by Sergej). You will not be assessed on classification.

By the end of this assignment, you should be able to demonstrate that you can:

Create simple charts using Bokeh

Manipulate data for processing with Python mathematical libraries such as Pandas and NumPy

Use the scikit-learn library for linear regression

Assignment 3: Data visualisation

Instructions

As before, access Visualisation assessment via Jupyter. The assessment is inside the 5. Visualisation directory.

Access Assignment 3: data visualisation via Jupyter.

If you have not done so in the previous step, you will need to 'Fetch' Visualisation to create a copy of it which will be stored in your allocated 'home directory'. The file 3. Visualisation assignment.ipynb is the assessed notebook.

Task

In this week's assignment, you are required to produce a visualisation dashboard of the food hygiene ratings across London. You will use Pandas, MongoDB, and Bokeh to create your dashboard.

Your dashboard should contain a map-based display of the ratings, placed according to their geolocation data. A user should be able to intuitively see which businesses are ‘safe’ to eat at, and those which have not scored so well.

You may supplement the map with additional charts and graphics that you deem appropriate to tell a coherent story from the data available.

By the end of this assignment, you should be able to demonstrate that you can:

Plot geographical data using Bokeh

Create interactive visualisations using Bokeh

Identify strengths and weaknesses of a visualisation