Notebooks for Master of Data Science Rennes
You can run these notebooks with Docker. The following command starts a container with the Notebook server listening for HTTP connections on port 8888 and 4040 without authentication configured.
git clone https://github.com/pnavaro/big-data.git
docker run --rm -d -v $HOME/big-data:/home/jovyan/ -p 8888:8888 -p 4040:4040 pnavaro/big-data
- Python for Data Analysis by Wes McKinney.
- Python Data Science Handbook by Jake VanderPlas
- Python
- Analyzing and Manipulating Data with Pandas Beginner: SciPy 2016 Tutorial by Jonathan Rocher.
- Dask
- Parallel Data Analysis with Dask Dask tutorial at PyCon 2018 by Tom Augspurger.
- Parallelizing Scientific Python with Dask SciPy 2018 Tutorial by James Crist and Martin Durant
- Parallelizing Scientific Python with Dask, SciPy 2017 Tutorial by James Crist.
- Parallel Python: Analyzing Large Datasets Intermediate, SciPy 2016 Tutorial by Matthew Rocklin.
- Parallel Data Analysis in Python, SciPy 2017 Tutorial by Matthew Rocklin, Ben Zaitlen & Aron Ahmadia.
- Hadoop
- Writing an Hadoop MapReduce Program in Python by Michael G. Noll.
- Spark
- Don't use Hadoop - your data isn't that big
- Format Wars: From VHS and Beta to Avro and Parquet overview of Hadoop File formats.
- Should you replace Hadoop with your laptop? by Vicki Boykis.
- Implementing MapReduce with multiprocessing by Doug Hellmann.
- Deploying Dask on YARN by Jim Crist.
- Native Hadoop file system (HDFS) connectivity in Python by Wes McKinney.
- Working Notes from Matthew Rocklin (must read)
- DataCamp Cheat Sheets
- Outils pour le Big Data by Pierre Nerzic. 🇫🇷
- wikistat - Ateliers Big Data by Philippe Besse. 🇫🇷
- Data Science and Big Data with Python by Steve Phelps.
Pierre
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.