This is a template repository for data science projects done in Python.
To most easily run this code out of the box, the following packages must be installed:
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
- great expectations
- h2o
- fastai
This is easiest to achieve through first installing an Anaconda distribution, which installs the first 5 packages and all of their dependencies. The install directions to the other packages may be found on their documentation pages.
Background
Data
Models
Timeline
Logistics
Provide an overview of the goals and deliverables of the project. Mention any relevant details or issues.
Provide a broad overview of the purpose of the project.
Describe the data - what kind of data is it? Describe the general format, and potential quirks.
If there are any security concerns or requirements regarding the data, they should be described here.
Describe the overall size of the dataset and the relative ratio of positive/negative examples for each of the response variables.
Clearly identify each of the response variables of interest. Any additional desired analysis should also be described here.
Outline the desired timeline of the project and any explicit deadlines.
Give a description of how the repository is structured. Example structure description below:
The repo is structured as follows: All *0- (e.g., 10-, 20-, 30-) files contain finalized work for the purpose described (e.g., "process-data"). Subfiles related to the task (e.g., 11-, 12-) should be created in order to explore and document relevant or interesting subtasks.
All files which appear in the repo should be able to run, and not contain error or blank cell lines, even if they are relatively midway in development of the proposed task. All notebooks relating to the analysis should have a numerical prefix (e.g., 31-) followed by the exploration (e.g. 31-text-labeling). Any utility notebooks should not be numbered, but be named according to their purpose. All notebooks should have lowercase and hyphenated titles (e.g., 10-process-data not 10-Process-Data). All notebooks should adhere to literate programming practices (i.e., markdown writing to describe problems, assumptions, conclusions) and provide adequate although not superfluous code comments.
Sprint planning:
Demo:
Data location:
Slack channel:
Zoom link:
Provide links to any resources that may be useful in running the repo (python/git/accre tutorials etc).
Add contact information for any project stakeholders. Include name, email and title.