This repository contains my workspace for doing Data Science in Python.
- Anaconda or Miniconda
- Apache Spark (with hadoop)
-
If not already existing, create a conda environment:
conda create -n data_science python=3.7
-
Activate the environment:
source activate data_science
-
Setup the workspace:
pip install -U pip numpy pip install -r requirements.txt python -m ipykernel install --user
-
Setup jupyter notebooks
jupyter contrib nbextension install --user jupyter nbextensions_configurator enable --user jupyter nbextension install https://github.com/drillan/jupyter-black/archive/master.zip --user jupyter nbextension enable jupyter-black-master/jupyter-black
-
Setup jupyter lab
jupyter labextension install jupyter-leaflet jupyter labextension install @jupyter-widgets/jupyterlab-manager jupyter labextension install @krassowski/jupyterlab_go_to_definition jupyter labextension install jupyterlab_bokeh jupyter labextension install ipysheet jupyter labextension install jupyterlab-drawio jupyter labextension install @jupyterlab/toc jupyter labextension install jupyterlab_vim jupyter labextension install @jupyterlab/git pip install jupyterlab-git jupyter serverextension enable --py jupyterlab_git jupyter labextension install @ryantam626/jupyterlab_code_formatter pip install jupyterlab_code_formatter jupyter serverextension enable --py jupyterlab_code_formatter
-
Reactivate the environment:
source deactivate data_science source activate data_science
-
Load the submodules:
git submodule init git submodule update
-
Activate the environment (if not already activated on this session):
source activate data_science
-
Set Spark environment variables:
export SPARK_HOME=/opt/spark export PATH=$SPARK_HOME/bin:$PATH
-
Start Jupyter Notebook:
jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000
-
Get the last changes from upstream:
git pull
-
Activate the environment (if not already activated on this session):
source activate data_science
-
Update the dependencies:
pip install -r requirements.txt
-
Reactivate the environment:
source deactivate data_science source activate data_science
-
Update submodules:
git submodule init git submodule update
-
Activate the environment (if not already activated on this session):
source activate data_science
-
Upgrade the dependencies:
pip-compile --upgrade pip install -r requirements.txt
-
Reactivate the environment:
source deactivate data_science source activate data_science
Facets is a tool for the visual exploration of datasets. It can be installed as following:
jupyter nbextension install facets/facets-dist/ --user
Then jupyter notebook should be started with an additional command line option:
--NotebookApp.iopub_data_rate_limit=10000000
The visualization can then be loaded as explained in the demo notebook.
For computers on linux with optimus, you have to make a kernel that will be called with "optirun" to be able to use GPU acceleration. For this go to the following folder:
cd ~/.local/share/jupyter/kernels/
then edit the file python3/kernel.json
in order to add "optirun"
as first
entry into the argv
array:
{
"language": "python",
"display_name": "Python 3",
"argv": [
"optirun",
"/home/fabien/.conda/envs/data_science/bin/python",
"-m",
"ipykernel",
"-f",
"{connection_file}"
]
}
I recommend installing the following notebook extension:
- Code prettify
- Codefolding
- Collapsible Headings
- contrib_nbextensions_help_item
- Execute time
- Initialization cells
- Jupyter Black
- Nbextensions dashboard tab
- Nbextensions edit menu item
- Notify
- Python Markdown
- Runtools
- ScrollDown
- Skip-Traceback
- spellchecker
- table_beautifier
- Table of Contents (2)
- Tree Filter
- VIM binding