This repository provides tools and analysis for understanding and evaluating the carbon footprint of machine learning models, primarily focusing on models from Hugging Face. The project is divided into two main parts: an initial data analysis and a subsequent web-application.
The data analysis seeks to answer two main research questions:
- How do ML model creators measure and report carbon emissions on Hugging Face?
- What aspects impact the carbon emissions of training ML models?
The web-application is a user-friendly tool that allows users to estimate the energy efficiency of their models through an energy label, visualize carbon emissions data from Hugging Face models, and add their own models to the dataset.
You can access the deployed app at energy-label.streamlit.app.
code/
: Contains the Jupyter notebooks for data extraction, preprocessing, and analysis.app/
: Contains the Streamlit web-application.datasets/
: Contains the raw, processed, and manually curated datasets used for the analysis.metadata/
: Contains thetags_metadata.yaml
file used during preprocessing.requirements.txt
: Lists the required Python packages to set up the environment and run the code.
Home.py
: The homepage of the web-application.pages/
: Contains individual page scripts for the web-application.1_Efficency_Label.py
: The energy label generation page.2_Data_Visualization.py
: The data visualization page.
energy_label.py
: Script for generating energy labels.label_generation.py
: Script for creating the image/pdf of the energy labels.plots.py
: Contains the plots for the data visualization page.data.py
: Contains functions to read data from Google Sheets.HFStreamlitPreprocessing.ipynb
: Jupyter notebook to apply necessary transformations on HFTotalProcessed.csv for the Streamlit app.label_design/parts
: Contains necessary images for the creation of the energy labels.
- Set up a Python virtual environment (optional, but recommended). We used Python 3.10.11 for this project.
- Install the required Python packages using
pip install -r requirements.txt
. Don't forget to install Streamlit withpip install streamlit
if you want to run the web app locally. - If you're planning to use the data analysis part, you need to handle the datasets. We use DVC to manage large datasets.
- Install DVC,
pip install dvc
. - Set up the DVC remote storage following the instructions on the DVC remote storage page.
- Pull the data from the remote storage with
dvc pull
.
- Install DVC,
- For the data analysis part, open the Jupyter notebooks in the
code/
folder and follow the instructions in each notebook. - To run the web-application locally, navigate to the root folder and run
streamlit run app/Home.py
.
Remember to cite the original project when using this code for your own research!
This project uses several datasets, which are managed with DVC due to their size. These datasets can be found in the datasets/
directory after running dvc pull
.