Getting dask up and running at the local T2, based on this.
Clone the repository and work inside it:
git clone https://github.com/aminnj/daskucsd
cd daskucsd
Install conda and get all the dependencies:
curl -O -L https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b
# add conda to the end of ~/.bashrc, so relogin after executing this line
~/miniconda3/bin/conda init
# stop conda from activating the base environment on login
conda config --set auto_activate_base false
conda config --add channels conda-forge
# install package to tarball environments
conda install --name base conda-pack -y
# create environments with as much stuff from anaconda
conda create --name workerenv uproot dask -y
conda create --name analysisenv uproot dask matplotlib pandas jupyter hdfs3 -y
# and then install residual packages with pip
conda run --name workerenv pip install dask-jobqueue
conda run --name analysisenv pip install jupyter-server-proxy dask-jobqueue
# make the tarball for the worker nodes
conda pack -n workerenv --arcroot workerenv -f --format tar.gz \
--compress-level 9 -j 8 --exclude "*.pyc" --exclude "*.js.map" --exclude "*.a"
Need a scheduler and a set of workers. You can either set up do this manually with some bash processes, or automatically within a jupyter notebook.
Start dask scheduler in a GNU screen/separate terminal:
( conda activate analysisenv && dask-scheduler --port 50123 )
Submit some workers:
python condor_utils.py -r <hostname:port of scheduler> -n 10
Start analysis jupyter notebook:
( conda activate analysisenv && jupyter notebook --no-browser )
( conda activate analysisenv && jupyter notebook --no-browser )
and then run cluster.ipynb
.
To forward port locally:
ps aux | grep "localhost:$PORT" | grep -v "grep" | awk '{print $2}' | xargs kill -9
ssh -N -f -L localhost:$PORT:localhost:$PORT $HOST