This Repository is dedicated to the development and completion of MSCI 446's (Intro to Machine Learning) course project.
- Dataset
- Quick setup
- Git commands
- Running ETL Script
- Preprocessing
- Models
- Predictions
The dataset was created by us and can be found uploaded to Kaggle here.
Setup is fairly simple, follow the below steps to download all dependencies and to configure anything needed onto your local machine
-.env
Configuration
- Install Python and pip (if you have not done so)
python -m venv venv
orpython3 -m venv venv
(creates virtual environment, this is done one time not every time)- Mac:
source venv/bin/activate
(activates virtual environment) - Windows:
.\venv\Scripts\activate
(activates virtual environment)
Make sure you have activated venv before doing any of the following steps or working on the project
pip install -r requirements.txt
(installs dependencies from the requirements file to the venv)pip freeze
(outputs list of all dependencies in the venv)pip freeze > requirements.txt
(overrides old requirements.txt with new dependencies)- If for what ever reason you have to deactivate the venv run:
deactivate
Before doing anything within the project, make sure you have activated the venv folder
For Windows: After cloning, Windows users should run the setup.bat
file to configure the environment:
Right-click on setup.bat
and select "Run as administrator" or in the terminal, type:
.\setup.bat
and press enter.
Install Xcode if you haven't already (Mac users): xcode-select --install
in the terminal if you want to use this approach
For Linux/Unix (Mac): After cloning, Mac/Linux users should open a terminal in the project directory and run:
make setup
Either way make sure the venv file in activated
This project requires certain environment variables to function correctly. Use the provided .env.template
file as a reference:
- Copy the
.env.template
file and rename it to.env
. - Fill in the required values as indicated in the template.
- Writing tests
- Code review
- Other guidelines
- Repo owner or admin
- Other community or team contact
All the commands are run in the terminal
git checkout -b "your-branch-name"
git add .
git commit -m "Your meaningful commit message here"
git push -u origin "your-branch-name" #publishes branch (this is optional but should do)
git push origin "your-branch-name" #publishes changes
- Go to the pull request tab on the left side of the UI and press create pull request.
- Make sure you are merging from and to the correct branches
- Create pull request
- Approve the pull request
- Merge the 2 branches
- Optional: Before merging the 2 branches there is a check box to delete the branch, you can if you want NEVER DELETE MASTER BRANCH
- If you deleted the branch: go to your local machine and switch from the branch you deleted in the remote repository to the master branch
- Delete the branch locally (command provided below)
git checkout "branch-name"
git checkout -b "new-branch-name"
git checkout "master" # switch to the main branch
git pull origin "master" # pull the latest changes
Before deleting a branch first switch to the master branch:
git checkout "master"
Then delete the branch, either from the Source control tab or from the terminal:
git branch -D "branch_name"
If you want to delete a branch from the remote repository:
git push --delete "remote_name branch_name"
The ETL script can be interacted with via the command line to perform various operations such as listing all available sub-folders within the specified root directory or processing data within a specific folder or across all folders. Below are the ways you can run the script depending on your requirements:
To list all available sub-folders in the root directory, use the --list
option:
python ETL/src/etl.py --list
This command will output all sub-folders found in the root directory, for example:
Available sub-folders: Gen_Outages Historical_DA_Prices Load_Forecast Solar_Forecast Wind_Forecast
If you wish to process CSV files within a specific folder, use the -f
or --folder
option followed by the name of the folder:
python ETL/src/etl.py --folder <Folder_Name>
For example:
python ETL/src/etl.py --folder Load_Forecast
This will execute the ETL process on CSV files located within the Load_Forecast
subfolder, extracting data, transforming it, and loading it into the MongoDB collection named after the folder.
To process all folders within the root directory, simply run the script without any options:
python ETL/src/etl.py --all
This triggers the script to process all sub-folders and their CSV files by default. Ensure the script's default behavior is set to process all folders if no specific folder is provided or no options are used.
To remove all documents from every collection in the MongoDB database without deleting the collections themselves, you can use the --clear
option:
python ETL/src/etl.py --clear
This command will clear all documents from each collection within your MongoDB database. It's a powerful operation that makes your collections empty, so use it with caution. The script will output the name of each collection being cleared and confirm once all collections have been processed.
For example, the output might look like this:
Clearing collection: Gen_Outages Clearing collection: Historical_DA_Prices Clearing collection: Load_Forecast Clearing collection: Solar_Forecast Clearing collection: Wind_Forecast All collections have been cleared.
Important Note: This operation will remove all existing data in the collections. Ensure you have backups or do not need the data before executing this command.
To list all available collections within the MongoDB database, you can use the --list-collections
option:
python ETL/src/etl.py --list-collections
This command will output all collections available in the database, allowing you to see the collections that you can clear or process individually.
If you want to clear all documents from a specific collection within your MongoDB database, you can use the --clear-collection
option followed by the name of the collection:
python ETL/src/etl.py --clear-collection <Collection_Name>
Replace <Collection_Name>
with the actual name of the collection you wish to clear. This operation will remove all documents from the specified collection but will not delete the collection itself. Use this command with caution to avoid unintentional data loss.
For example, to clear the Load_Forecast collection:
python ETL/src/etl.py --clear-collection Load_Forecast
This will clear all documents from the Load_Forecast
collection and output a confirmation message:
Clearing collection: Load_Forecast Collection Load_Forecast has been cleared.
Important Note: Ensure you have backups or do not need the data before executing this command to avoid unintended data loss.
The preprocessing folder holds the notebook used to merge and normalize all datasets together into one dataset
Holds Neural Networks
Holds notebooks for each type of ML model used (regressions, trees, and a neural network), it's train and test and scores.