/Statistics-and-Econometrics-for-Data-Science

This repository contains notebooks for understanding some concepts of statistics and econometrics that can be helpful in data science

Primary LanguageJupyter NotebookMIT LicenseMIT

Possible project for Kharagpur Winter of Code 2020

Statistics and Econometrics for Data Science

Table of Contents

  1. How are the topics even related to ML?
  2. What will the project entail?
  3. How to start with the project?
  4. What are the prerequisites for the project?
  5. What can you contribute to the project?
  6. Expectations from the project
  7. How much is ML and how much is statistics/econometrics?
  8. Who to contact?

How are the topics even related to ML?

Often while building models in ML we become too concerned with accuracy and forget whether the model actually does what we initially set out to do. Statistics and Econometrics help in building better models and understanding the data. They can help in better feature engineering, and better understanding of the assumptions which can help in ultimately building better models. Running linear regression sounds easy but what if someone asks you what assumptions you made while running the model. If your answer is "Umm..." then you are on the track to understanding what these topics can contribute to ML (if you didn't already know).

Due to certain limitations, for the time being, we are concerned with only Linear Regression. This is just a very small subset of ML but let's start with tiny steps to progress.

What will the project entail?

The project aims to have a series of notebooks that will help in understanding the basic topics. The notebooks could be used to get a broad overview of the topic or to quickly revise the topic. The notebooks can be helpful in the following ways:

  • You are participating in a competition and you want to run some quick checks on the data/model
  • You are sitting for internship/placement and need to revise some topics fast
  • You want some code snippet for a certain test and how to interpret the test results.

How to start with the project?

  1. Install Jupyter Notebook, recommended installing with Anaconda
  2. Learn how to use Jupyter Notebook, and python libraries numpy, pandas and matplotlib
  3. Clone this repo and make a new branch
  4. Each ipynb file should be able to stand independently so you should be able to open it using Jupyter Notebook

What are the prerequisites for the project?

  • Basic knowledge of at least one programming language (preferable python)
  • Basic knowledge of probability (class 12 level)
  • Desire to learn statistics

What can you contribute to the project?

Easy: Make some changes to the existing graphs or explanation to make them look better, add new ideas to 'ideas.md', check if existing notebooks make sense

Intermediate: Start off with a new notebook of your own

Advanced: Make a series of notebooks or explain a complicated/advanced topic

Expectations from the project

There will be a variety of issues, some easy to get you started and one harder to make you significantly contribute. But I'll set down the minimum expected work that you should do to pass. By midevals, you should have at least one new notebook and by endevals, you should have at least three new notebooks ready. Each notebook should have some introduction to the topic, mathematical proofs if required, the code to implement that topic from scratch and any ready-made library code, if available.

The notebook referred to here are Jupyter Notebooks.

How much is ML and how much is statistics/econometrics?

Well, your learning from this will be less towards ML. These topics are to provide support to ML and do not replace the importance of doing a course/project purely based on machine learning.

Who to contact?

The project was started by PetalsOnWind (Pankhuri Saxena, a fourth year Economics student at IIT KGP). She can be reached at pankhurisaxena[dot]iitkgp[at]gmail[dot]com.