All notebooks and material from the Southampton data science fundamentals (technical) course (July - August 2018). The assignments folder contains my completed work that resulted in an overall score of 100.
Recent update: Bokeh.charts is now deprecated
The course runs over 6 weeks and is broken down into manageable weekly topics:
- Welcome and introduction to the course
- What data science is and why it’s important
- A ‘hands-on’ Jupyter familiarisation activity
- Python Primer
- Glossary of terminology
- The data science process
- A data science toolkit
- Types of data and example applications
- Sources of data
- Data collection and APIs
- Exploring and fixing data
- Data storage and management
- Using multiple data sources
- Introduction to statistics
- Basic machine-learning algorithms
- Types of data visualisation
- Data for visualisation
- Technologies for visualisation
An exploration of the future of data science. After successfully completing the course, you’ll be able to:
- Understand key concepts in data science and their real-world applications
- Explain how data is collected, managed and stored in the context of data science
- Implement data collection and management scripts using MongoDB
- Demonstrate an understanding of statistics and machine-learning concepts vital for data science
- Produce Python code to statistically analyse a dataset
- Plan and generate visualisations from data using tools such as Python and Bokeh
- Work effectively with live data and utilise the opportunities presented by cloud services
T his week, you will be required to implement the first stages of a data processing pipeline, which will involve the transformation, storage and querying of data.
To help you prepare for this week's assignment we have provided you with two guided exercises with worked answers and these are again accompanied by video walkthroughs: 'HTML & page scraping' and 'Using MongoDB to retrieve information'.
In this week's assignment you will go through the process of obtaining data, cleaning it, and then querying it from a database. You will be provided with an open dataset containing food hygiene inspection results for businesses across London. Each record will contain information about the business’s location, date of inspection, and inspection results. You must construct a data processing pipeline using a MongoDB datastore, and run and compute a set of queries which will be given to you.
All code should be written in Python. Again, you will be using the Jupyter working environment to do this.
By the end of this assignment, you should be able to demonstrate that you can:
Apply data collection techniques to store and manage data in MongoDB
Write queries to extract data from MongoDB
Instructions
As in week 3, access the Statistics and Machine Learning assessment via Jupyter. The assessment is inside the 4. Statistics and Machine Learning directory.
You may need to refer to the instructions from Jupyter: your working environment to access this assignment.
If you have not done so in the previous step, you will need to 'Fetch' Statistics and Machine Learning to create a copy of it which will be stored in your allocated 'home directory'. The file 4. Statistics and Machine Learning Assessment.ipynb is the assessed notebook.
Task
In this week's assignment, you will use the Python libraries Pandas and NumPy to perform basic operations on the data relating to food hygiene in Wandsworth. In addition you will also perform a linear regression on a dataset about vehicle mileage and price (similar to the examples given by Sergej). You will not be assessed on classification.
By the end of this assignment, you should be able to demonstrate that you can:
Create simple charts using Bokeh
Manipulate data for processing with Python mathematical libraries such as Pandas and NumPy
Use the scikit-learn library for linear regression
Instructions
As before, access Visualisation assessment via Jupyter. The assessment is inside the 5. Visualisation directory.
Access Assignment 3: data visualisation via Jupyter.
If you have not done so in the previous step, you will need to 'Fetch' Visualisation to create a copy of it which will be stored in your allocated 'home directory'. The file 3. Visualisation assignment.ipynb is the assessed notebook.
Task
In this week's assignment, you are required to produce a visualisation dashboard of the food hygiene ratings across London. You will use Pandas, MongoDB, and Bokeh to create your dashboard.
Your dashboard should contain a map-based display of the ratings, placed according to their geolocation data. A user should be able to intuitively see which businesses are ‘safe’ to eat at, and those which have not scored so well.
You may supplement the map with additional charts and graphics that you deem appropriate to tell a coherent story from the data available.
By the end of this assignment, you should be able to demonstrate that you can:
Plot geographical data using Bokeh
Create interactive visualisations using Bokeh
Identify strengths and weaknesses of a visualisation