This repository is an example of what a python package or project should include. It is also a self-guided tour of how to build a maintainable, professional python code. To that end, each file includes comments to explain what it is for, how to modify it, and further resources to improve the reader's understanding.
We created this project as a launch platform for data science training. There are many notebooks in the wild to teach you how to use boosted trees or how to display pretty outputs. But so much of bad data science involves bad coding, and so much of good data science requires good code. If we're going to teach data science, we'll have to do it with good code.
We decided to create a skeleton python distribution with all the bells and
whistles. If in the future we so desire, we can make data science lessons as branches to the repository,
with packages included in this distribution. This format would force aspiring
data scientists to use git and interact with python's packaging structure.
Moreover, it would force trainers to use git and python packaging at a
higher level of competency.
You should be able to use the master branch of this repo as a starting point for any python project.
Sometimes, you want to build a stand-alone application that only you care about. So you may start with some code, but before long, you want to re-use that code that parsed some input. So you copy and paste it where you need it, but later on your inputs change format. Now you need to modify both the original parsing code and the copied one. But you used it all over the place - can you be sure that you fixed each copy?
All of sudden, you wish that an earlier incarnation of you had the foresight to
structure your code better. And what the hell did you mean when, in a bout
of frustration, you commented "need to unfuck this variable"? It's completely
unclear how the code that follows manages to unfuck anything. Since the output
still looks slightly fucked, you change unfuck(this)
to str(unfuck(this))
and hope it doesn't break anything.
Finally you want to share your wonderful project. You find and replace 'unfuck' with 'fix', because you're a professional and you own a few ties and have a nice pair of shoes. Then you ship your code to someone and they say, "I don't get how it works, but it does cool stuff. Can you do it with this data?" The answer is no, not for cheap.
Whenever a project takes more than one session of butt in chair, we forget important stuff about how it works. Things like variable names. Column names in a table. What the point of this or that function is. The reality is, we start each session of code as a new person. When you write code, you are ALWAYS writing for the most important client - a future incarnation of yourself. That client's time is valuable, even if they don't own a tie or a pair of dress shoes.
You've learned the first rule of programming. If you want to write good code, don't write a script or a notebook. Write libraries. Write packages.
- Set up git from the command line (or GitHub desktop) with your credentials
- Clone this repository
- Open Anaconda prompt (Helps to use PowerBroker Admin sometimes)
- create, activate, and setup a virtual environment (read requirements.txt or environment.yml)
- Run
pip install -e .
(don't forget the final period for current directory) from the project directory. - Test that everything is copacetic. Use
cd test
thenpython -m unittest test.py
- A basic knowledge of python (how to import modules, assign variables)
- A basic knowledge of programming (what a loop is, what a function is)
- Familiarity with an IDE (Jupyter doesn't count!)
- This file
- .gitignore
- requirements.txt, environment.yml, and setup.py
- __init__.py
- notebooks/example.ipynb
- Documentation
- seattle/utils.py
- seattle/needle.py
- test/test.py
- docs/README.md
- LICENSE.md
- CONTRIBUTING.md
- vis.py
Except for example files, the contents of data/ and notebooks/ are gitignored.