This guide provides instructions for using Python on research projects. Its purpose is to use with collaborators and research assistants to make code consistent, easier to read, transparent, and reproducible.
For coding style practices, follow the PEP 8 style guide.
- While you should read the style guide and do your best to follow it, there are packages to help you.
- In Jupyter Notebooks before you write your script you can install three packages
flake8
,pycodestyle
, andpycodestyle_magic
. - If you are in a Jupyter notebook, after importing your files Run
%load_ext pycodestyle_magic
and%flake8_on
in two blank cells, and each cell afterwards will be checked for styling errors upon running. - In Spyder go to Tools > Preferences > Editor > Code Introspection/Analysis and activate the option called
Real-time code style analysis
. This will show bad formatting warnings directly in the editor.
- In Jupyter Notebooks before you write your script you can install three packages
- Use
pandas
for wrangling data. - Use
datetime
for working with dates. - Never use
os.chdir()
or absolute file paths. Instead use relative file paths with thepyprojroot
package.- If you have private information on something like Boxcryptor, this would be the only exception to the rule, in that case, note in your file that this line must be changed.
pyprojroot
looks for the following files to determine which oflder is your root folder for the project: .git, .here, *.Rproj, requirements.txt, setup.py, .dvc, .spyproject, pyproject.toml, .idea, .vscode. If you don't have any of them, create a blank file with one of these names in your project root directory.
- Use
assert
frequently to add programmatic sanity checks in the code pandas.describe()
can be useful to print a "codebook" of the data, i.e. some summary stats about each variable in a data set.- Use
pipconflictchecker
to make sure there are not dependency conflicts after mass installing packages through pip. - Use
fastreg
for fast sparse regressions, particularly good for high-dimensional fixed effects.
Generally, within the folder where we are doing data analysis (the project's "root folder"), we have the following files and folders.
- .here, .git, or setup.py
- If you always open the project from the project's root folder (e.g., by navigating to that folder in the terminal with
cd
before running the commandjupter-lab
to open Jupyter in your browser), then thepyprojroot
package will work for relative filepaths.
- If you always open the project from the project's root folder (e.g., by navigating to that folder in the terminal with
- data - only raw data go in this folder
- documentation - documentation about the data go in this folder
- proc - processed data sets go in this folder
- results - results go in this folder
- figures - subfolder for figures
- tables - subfolder for tables
- scripts - code goes in this folder
- Number scripts in the order in which they should be run
- programs - a subfolder containing functions called by the analysis scripts (if applicable)
- old - a subfolder where old scripts from previous versions are stored if there are major changes to the structure of the project for cleanliness
Because we often work with large data sets and efficiency is important, I advocate (nearly) always separating the following three actions into different scripts:
- Data preparation (cleaning and wrangling)
- Analysis (e.g. regressions)
- Production of figures and tables
The analysis and figure/table scripts should not change the data sets at all (no pivoting from wide to long or adding new variables); all changes to the data should be made in the data cleaning scripts. The figure/table scripts should not run the regressions or perform other analysis; that should be done in the analysis scripts. This way, if you need to add a robustness check, you don't necessarily have to rerun all the data cleaning code (unless the robustness check requires defining a new variable). If you need to make a formatting change to a figure, you don't have to rerun all the analysis code (which can take awhile to run on large data sets).
- Include a 00_run.py script (described below).
- Number scripts in the order in which they should be run, starting with 01.
- Because a project often uses multiple data sources, I usually include a brief description of the data source being used as the first part of the script name (in the example below,
ex
describes the data source), followed by a description of the action being done (e.g.dataprep
,reg
, etc.), with each component of the script name separated by an underscore (_
).
Keep a script that lists each script that should be run to go from raw data to final results. Under the name of each script should be a brief description of the purpose of the script, as well all the input data sets and output data sets that it uses. Ideally, a user could run the master script to run the entire analysis from raw data to final results (although this may be infeasible for some project, e.g. one with multiple confidential data sets that can only be accessed on separate servers).
# Run script for example project
# PACKAGES ------------------------------------------------------------------
import os
import subprocess
from pyprojroot import here
# PRELIMINARIES -------------------------------------------------------------
# Control which scripts run
run_01_ex_dataprep = 1
run_02_ex_reg = 1
run_03_ex_table = 1
run_04_ex_graph = 1
program_list = []
# RUN SCRIPTS ---------------------------------------------------------------
if run_01_ex_dataprep:
program_list.append(here('./scripts/run_01_ex_dataprep.py'))
# INPUTS
# here("./data/example.csv") # raw data from XYZ source
# OUTPUTS
# here("./proc/example_cleaned.csv") # cleaned
if run_02_ex_reg:
program_list.append(here("./scripts/run_02_ex_reg.py"))
# INPUTS
# here("./proc/example_cleaned.csv") # 01_ex_dataprep.py
# OUTPUTS
# here("./proc/ex_fixest.csv") # fixest object from feols regression
if run_03_ex_table:
program_list.append(here("./scripts/run_03_ex_table.py"))
# Create table of regression results
# INPUTS
# here("./proc/ex_fixest.csv") # 02_ex_reg.py
# OUTPUTS
# here("./results/tables/ex_fixest_table.tex") # tex of table for paper
if run_04_ex_graph:
program_list.append(here('./scripts/run_04_ex_graph.py'))
# Create scatterplot of Y and X with local polynomial fit
# INPUTS
# here("./proc/example_cleaned.csv") # 01_ex_dataprep.py
# OUTPUTS
# here("./results/tables/ex_scatter.eps") # figure
for program in program_list:
subprocess.call(['python', program])
print("Finished:" + str(program))
If your scripts are .ipynb rather than .py files, instead of using subprocess.call()
to run the list of programs in program_list
, replace the subprocess.call()
loop with the following:
import nbformat
from nbconvert.preprocessors import ExecutePreprocessor
for program in program_list:
with open(program) as f:
nb = nbformat.read(f, as_version=1)
ep = ExecutePreprocessor(timeout=-1, kernel_name='python3')
ep.preprocess(nb, {'metadata': {'path': here('./scripts')}})
print("Finished:" + str(program))
- Use
matplotlib
for graphing. For graphs with colors, usecubehelix
for a colorblind friendly palette. - For reproducible graphs, always specify the
width
andheight
arguments insavefig
. - To see what the final graph looks like, open the file that you save since its appearance will differ from what you see in the Jupyter Notebook.
- For high resolution, save graphs as .pdf or .eps files.
- I've written a Python function
crop_eps
to crop .eps files for the times when you can't get the cropping just right crop_pdf
coming soon.
- I've written a Python function
- For maps (and working with geospatial data more broadly), use
GeoPandas
.
- For small data sets, save as .csv with
pandas.to_csv()
and read withpandas.read_csv()
. - For larger data sets, save with
pandas.to_pickle()
using a .pkl file extension, and read withpandas.read_pickle()
. - For truly big data sets (hundreds of millions or billions of observations), use
write.parquet()
andread.parquet()
frompyspark.sql
.
When randomizing assignment in a randomized control trial (RCT):
- Seed: Use a seed from https://www.random.org/: put Min 1 and Max 100000000, then click Generate, and copy the result into your script at the appropriate place. Towards the top of the script, assign the seed with the line
where
seed = ... # from random.org random.seed(seed)
...
is replaced with the number that you got from random.org - Use the
stochatreat
package to assign treatment and control groups. - Build a randomization check: create a second variable a second time with a new name, repeating
random.seed(seed)
immediately before creating the second variable. Then check that the randomization is identical usingassert(df.var1 == df.var2)
. - It is also good to do a more manual check where you run the full script once, save the resulting data with a different name, then restart Python (see instructions below), run it a second time. Then read in both data sets with the random assignment and assert that they are identical.
Above I described how data preparation scripts should be separate from analysis scripts. Randomization scripts should also be separate from data preparation scripts, i.e. any data preparation needed as an input to the randomization should be done in one script and the randomization script itself should read in the input data, create a variable with random assignments, and save a data set with the random assignments.
Once you complete a Jupyter notebook, which you might be running line by line, make sure it runs on a fresh Python session. To do this, use the menus and select Kernel
> Restart and run all
to ensure that the script runs in its entirety.
Create a virtual environment to run your project. Use a virtual environment through venv
(instead of pyenv
) to manage the packages in a project and avoid conflicts related to package versioning.
- If you are using Anaconda, navigate to the directory of the project in the command line, and type
conda create -n yourenvname python=x.x anaconda
. Activate the environment usingconda activate yourenvname
anddeactivate
will exit the environment. - First run
conda install pip
to install pip to your directory. - Final step in Anaconda to install the packages, find your anaconda directory, it should be something like
/anaconda/envs/venv_name/
. Install new packages by using/anaconda/envs/venv_name/bin/pip install package_name
, this can also be used to install the requirements.txt file. To create arequirements.txt
file usepip freeze -l > requirements.txt
- If you are only using Python3,
python3 -m venv yourenvname
will create your environment. Activate the environment usingsource activate yourenvname
anddeactivate
will exit the environment. - In the command line after activating your virtual environment in Python3 using
pip freeze > requirements.txt
will create a text document of the packages in the environment to include in your project directory. pip install -r requirements.txt
in a virtual environment will install all the required packages for the project in Python3.