Identify 2-3 public datasets that would be interesting for a business to answer questions about
Closed this issue · 4 comments
lfunderburk commented
Explore this source:
https://archive.ics.uci.edu/ml/index.php
Goal: comment in this issue the top three interesting datasets that we can use throughout the course
lfunderburk commented
These look great!
For each of those, repurpose the code I used in the absenteeism notebook (also here for a reference), to extract the data from the links
from urllib.request import urlretrieve
from zipfile import ZipFile
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00445/Absenteeism_at_work_AAA.zip"
# download the file
urlretrieve(url, "./raw-data/Absenteeism_at_work_AAA.zip")
# Extract the CSV file
with ZipFile("./raw-data/Absenteeism_at_work_AAA.zip", 'r') as zf:
zf.extractall("./raw-data/")
# Check the extracted CSV file name (in this case, it's "Absenteeism_at_work.csv")
csv_file_name = "./raw-data/Absenteeism_at_work.csv"
# Data clean up
df = pd.read_csv(csv_file_name, sep=",")
df.columns = df.columns.str.replace(' ', '_')
# Save the cleaned up CSV file
df.to_csv("Absenteeism_at_work_cleaned.csv", index=False)
lfunderburk commented
We can then use the code snippet in the notebooks.
Might be worth creating a util or helper function that takes as input the .zip link and as output returns a clean csv file
lfunderburk commented
This has been resolved. @jpjon keep the datasets you picked handy.