DublinDataEngineering

The Open Source resources in Data Engineering, Machine Learning, Data Science areas, inspired by http://datasciencemasters.org/ .

Toolset

The Jupyter notebook is a part of Anaconda Distribution

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.

Introduction to Python (Pandas library)

Python is powerful... and fast; plays well with others; runs everywhere; is friendly & easy to learn; is Open.

Pandas package is a part of Anaconda Distribution

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Introduction to Panda
Intro to pandas data structures by Greg Reda
Python for Data Analysis is a good book to start, written by Wes McKinney the main author of Pandas package. The second book is planning to release in August 2017.

Introduction to Python (Numpy library)

NumPy is the fundamental package for scientific computing with Python. http://www.numpy.org/
Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. http://matplotlib.org/
Seaborn is a Python visualization library based on matplotlib. https://seaborn.pydata.org/

Data Wrangling with Python

Data wrangling is about taking a messy or unrefined source of data and turning it into something useful.

Data Wrangling Stages

Formulating a question
Data Acquisition
Data Cleaning
Data Exploration
Communicating the data findings
Scaling with larger datasets
Automating the process

Source Data Wrangling with Python by Jacqueline Kazil, Katharine Jarmul.

Data Wrangling with Python notebook

Machine Learning with Python

Introduction to Machine Learning with Python by Andreas C. Mueller , Sarah Guido

Supervised Learning: Classification and Regression

Introduction to Scikit-Learn notebook

Unsupervised Learning: Clustering, Dimensionality Reduction