ploomber/sql

Identify 2-3 public datasets that would be interesting for a business to answer questions about

Closed this issue · 4 comments

Explore this source:

https://archive.ics.uci.edu/ml/index.php

Goal: comment in this issue the top three interesting datasets that we can use throughout the course

These look great!

For each of those, repurpose the code I used in the absenteeism notebook (also here for a reference), to extract the data from the links

from urllib.request import urlretrieve
from zipfile import ZipFile
import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00445/Absenteeism_at_work_AAA.zip"

# download the file
urlretrieve(url, "./raw-data/Absenteeism_at_work_AAA.zip")

# Extract the CSV file
with ZipFile("./raw-data/Absenteeism_at_work_AAA.zip", 'r') as zf:
    zf.extractall("./raw-data/")

# Check the extracted CSV file name (in this case, it's "Absenteeism_at_work.csv")
csv_file_name = "./raw-data/Absenteeism_at_work.csv"

# Data clean up
df = pd.read_csv(csv_file_name, sep=",")
df.columns = df.columns.str.replace(' ', '_')

# Save the cleaned up CSV file
df.to_csv("Absenteeism_at_work_cleaned.csv", index=False)

We can then use the code snippet in the notebooks.

Might be worth creating a util or helper function that takes as input the .zip link and as output returns a clean csv file

This has been resolved. @jpjon keep the datasets you picked handy.