Identify 2-3 public datasets that would be interesting for a business to answer questions about

Question

Identify 2-3 public datasets that would be interesting for a business to answer questions about

Closed this issue a year ago · 4 comments

lfunderburk commented a year ago

Explore this source:

https://archive.ics.uci.edu/ml/index.php

Goal: comment in this issue the top three interesting datasets that we can use throughout the course

jpjon commented a year ago

Answer 1 · 2023-06-06T02:42:23.000Z

These look great!

For each of those, repurpose the code I used in the absenteeism notebook (also here for a reference), to extract the data from the links

from urllib.request import urlretrieve
from zipfile import ZipFile
import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00445/Absenteeism_at_work_AAA.zip"

# download the file
urlretrieve(url, "./raw-data/Absenteeism_at_work_AAA.zip")

# Extract the CSV file
with ZipFile("./raw-data/Absenteeism_at_work_AAA.zip", 'r') as zf:
    zf.extractall("./raw-data/")

# Check the extracted CSV file name (in this case, it's "Absenteeism_at_work.csv")
csv_file_name = "./raw-data/Absenteeism_at_work.csv"

# Data clean up
df = pd.read_csv(csv_file_name, sep=",")
df.columns = df.columns.str.replace(' ', '_')

# Save the cleaned up CSV file
df.to_csv("Absenteeism_at_work_cleaned.csv", index=False)

Answer 2 · 2023-06-06T02:43:26.000Z

We can then use the code snippet in the notebooks.

Might be worth creating a util or helper function that takes as input the .zip link and as output returns a clean csv file

Answer 3 · 2023-06-07T18:21:47.000Z

This has been resolved. @jpjon keep the datasets you picked handy.