/DublinDataEngineering

The Open Source resources in Data Engineering, Machine Learning, Data Science areas, inspired by [The Open-Source Data Science Masters] (http://datasciencemasters.org/).

Primary LanguageJupyter Notebook

DublinDataEngineering

The Open Source resources in Data Engineering, Machine Learning, Data Science areas, inspired by http://datasciencemasters.org/ .

Toolset

The Jupyter notebook is a part of Anaconda Distribution

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.

Introduction to Python (Pandas library)

Pandas package is a part of Anaconda Distribution

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Introduction to Python (Numpy library)

Data Wrangling with Python

Data wrangling is about taking a messy or unrefined source of data and turning it into something useful.

Data Wrangling Stages

  • Formulating a question

  • Data Acquisition

  • Data Cleaning

  • Data Exploration

  • Communicating the data findings

  • Scaling with larger datasets

  • Automating the process

Source Data Wrangling with Python by Jacqueline Kazil, Katharine Jarmul.

Data Wrangling with Python notebook

Machine Learning with Python

Introduction to Machine Learning with Python by Andreas C. Mueller , Sarah Guido

Supervised Learning: Classification and Regression

Introduction to Scikit-Learn notebook

Unsupervised Learning: Clustering, Dimensionality Reduction

Introduction to Scikit-Learn notebook2