The Open Source resources in Data Engineering, Machine Learning, Data Science areas, inspired by http://datasciencemasters.org/ .
The Jupyter notebook is a part of Anaconda Distribution
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.
Pandas package is a part of Anaconda Distribution
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
-
Python for Data Analysis is a good book to start, written by Wes McKinney the main author of Pandas package. The second book is planning to release in August 2017.
-
NumPy is the fundamental package for scientific computing with Python. http://www.numpy.org/
-
Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. http://matplotlib.org/
-
Seaborn is a Python visualization library based on matplotlib. https://seaborn.pydata.org/
Data wrangling is about taking a messy or unrefined source of data and turning it into something useful.
Data Wrangling Stages
-
Formulating a question
-
Data Acquisition
-
Data Cleaning
-
Data Exploration
-
Communicating the data findings
-
Scaling with larger datasets
-
Automating the process
Source Data Wrangling with Python by Jacqueline Kazil, Katharine Jarmul.
Data Wrangling with Python notebook
Introduction to Machine Learning with Python by Andreas C. Mueller , Sarah Guido
Supervised Learning: Classification and Regression
Introduction to Scikit-Learn notebook
Unsupervised Learning: Clustering, Dimensionality Reduction