Awesome-Machine-Learning-DataScience_Resources: A repository from rohan-paul

If you are new to Data Science
Natural Language Processing (NLP)
Statistics
Blogs and other Community based resources
ML Math
Inspirational Stories of people breaking into Machine Learning and Data-Science
General Datasets
Finance Related Datasets
Super Large Kaggle Datasets
Numpy
TensorFlow
Machine Learning & Deep Learning Tutorials
Interview-Related-Links
Genetic Algorithms
Kaggle Competitions WriteUp
List of Most Starred Github Projects related to Deep Learning
Linear Regression Tutorials
Logistic Regression Tutorials
K Nearest Neighbors Tutorials
Best Courses
Top Machine Learning Podcasts
Most Important Deep Learning Papers
Great Deep Learning Paper Year-wise
Deep Learning Resources
Best Deep Learning Courses
Awesome Deep Learning Projects
Project Ideas for deep learning and general machine learning
Natural Language Project Ideas
Forecasting Project Ideas
Recommendation systems Project Ideas
Some of the Best Kaggle Competitions for Beginners
Kaggle Strategies and skills

If you are new to Data Science

Preview	Description
	Key differences of a data scientist vs. data engineer
	A visual guide to Becoming a Data Scientist in 8 Steps by DataCamp (img)
	Mindmap on required skills (img)
	Swami Chandrasekaran made a Curriculum via Metro map.
	by @kzawadz via twitter
	By Data Science Central
	From this article by Berkeley Science Review.
	Data Science Wars: R vs Python
	By Data Science Central
	From this article by Berkeley Science Review.
	Data Science Wars: R vs Python

[↑] Back to top

Natural Language Processing (NLP)

Datasets is a lightweight and extensible library to easily share and access datasets and evaluation metrics for Natural Language Processing (NLP) and more.
The-NLP-Pandect-was created to help you find almost anything related to NLP that is available online.

[↑] Back to top

Statistics

Jupyter notebooks for the book "The Elements of Statistical Learning"

[↑] Back to top

Blogs and other Community based resources

[↑] Back to top

ML Math

[↑] Back to top

Inspirational Stories of people breaking into Machine Learning and Data-Science

[↑] Back to top

General Datasets

MNIST Handwritten digits
Google House Numbers from street view
CIFAR-10 and CIFAR-100
Academic Torrents
hadoopilluminated.com
data.gov - The home of the U.S. Government's open data
IMAGENET
Tiny Images 80 Million tiny images6.
Flickr Data 100 Million Yahoo dataset
Berkeley Segmentation Dataset 500
UC Irvine Machine Learning Repository
Flickr 8k
Flickr 30k
Microsoft COCO
VQA
enigma.com - Navigate the world of public data - Quickly search and analyze billions of public records published by governments, companies and organizations.
datahub.io
United States Census Bureau
Image QA
AT&T Laboratories Cambridge face database
AVHRR Pathfinder
usgovxml.com
aws.amazon.com/datasets
Air Freight - The Air Freight data set is a ray-traced image sequence along with ground truth segmentation based on textural characteristics. (455 images + GT, each 160x120 pixels). (Formats: PNG)
Amsterdam Library of Object Images - ALOI is a color image collection of one-thousand small objects, recorded for scientific purposes. In order to capture the sensory variation in object recordings, we systematically varied viewing angle, illumination angle, and illumination color for each object, and additionally captured wide-baseline stereo images. We recorded over a hundred images of each object, yielding a total of 110,250 images for the collection. (Formats: png)
Annotated face, hand, cardiac & meat images - Most images & annotations are supplemented by various ASM/AAM analyses using the AAM-API. (Formats: bmp,asf)
Image Analysis and Computer Graphics
Brown University Stimuli - A variety of datasets including geons, objects, and "greebles". Good for testing recognition algorithms. (Formats: pict)
CAVIAR video sequences of mall and public space behavior - 90K video frames in 90 sequences of various human activities, with XML ground truth of detection and behavior classification (Formats: MPEG2 & JPEG)
Machine Vision Unit
CCITT Fax standard images - 8 images (Formats: gif)
CMU CIL's Stereo Data with Ground Truth - 3 sets of 11 images, including color tiff images with spectroradiometry (Formats: gif, tiff)
CMU PIE Database - A database of 41,368 face images of 68 people captured under 13 poses, 43 illuminations conditions, and with 4 different expressions.
CMU VASC Image Database - Images, sequences, stereo pairs (thousands of images) (Formats: Sun Rasterimage)
Caltech Image Database - about 20 images - mostly top-down views of small objects and toys. (Formats: GIF)
Columbia-Utrecht Reflectance and Texture Database - Texture and reflectance measurements for over 60 samples of 3D texture, observed with over 200 different combinations of viewing and illumination directions. (Formats: bmp)
Computational Colour Constancy Data - A dataset oriented towards computational color constancy, but useful for computer vision in general. It includes synthetic data, camera sensor data, and over 700 images. (Formats: tiff)
Computational Vision Lab
Content-based image retrieval database - 11 sets of color images for testing algorithms for content-based retrieval. Most sets have a description file with names of objects in each image. (Formats: jpg)
Efficient Content-based Retrieval Group
Densely Sampled View Spheres - Densely sampled view spheres - upper half of the view sphere of two toy objects with 2500 images each. (Formats: tiff)
Computer Science VII (Graphical Systems)
Digital Embryos - Digital embryos are novel objects which may be used to develop and test object recognition systems. They have an organic appearance. (Formats: various formats are available on request)
Univerity of Minnesota Vision Lab
El Salvador Atlas of Gastrointestinal VideoEndoscopy - Images and Videos of his-res of studies taken from Gastrointestinal Video endoscopy. (Formats: jpg, mpg, gif)
FG-NET Facial Aging Database - Database contains 1002 face images showing subjects at different ages. (Formats: jpg)
FVC2000 Fingerprint Databases - FVC2000 is the First International Competition for Fingerprint Verification Algorithms. Four fingerprint databases constitute the FVC2000 benchmark (3520 fingerprints in all).
Biometric Systems Lab - University of Bologna
Face and Gesture images and image sequences - Several image datasets of faces and gestures that are ground truth annotated for benchmarking
German Fingerspelling Database - The database contains 35 gestures and consists of 1400 image sequences that contain gestures of 20 different persons recorded under non-uniform daylight lighting conditions. (Formats: mpg,jpg)
Language Processing and Pattern Recognition
Groningen Natural Image Database - 4000+ 1536x1024 (16 bit) calibrated outdoor images (Formats: homebrew)
ICG Testhouse sequence - 2 turntable sequences from ifferent viewing heights, 36 images each, resolution 1000x750, color (Formats: PPM)
Institute of Computer Graphics and Vision
IEN Image Library - 1000+ images, mostly outdoor sequences (Formats: raw, ppm)
INRIA's Syntim images database - 15 color image of simple objects (Formats: gif)
INRIA
INRIA's Syntim stereo databases - 34 calibrated color stereo pairs (Formats: gif)
Image Analysis Laboratory - Images obtained from a variety of imaging modalities -- raw CFA images, range images and a host of "medical images". (Formats: homebrew)
Image Analysis Laboratory
Image Database - An image database including some textures
JAFFE Facial Expression Image Database - The JAFFE database consists of 213 images of Japanese female subjects posing 6 basic facial expressions as well as a neutral pose. Ratings on emotion adjectives are also available, free of charge, for research purposes. (Formats: TIFF Grayscale images.)
ATR Research, Kyoto, Japan
JISCT Stereo Evaluation - 44 image pairs. These data have been used in an evaluation of stereo analysis, as described in the April 1993 ARPA Image Understanding Workshop paper ``The JISCT Stereo Evaluation'' by R.C.Bolles, H.H.Baker, and M.J.Hannah, 263--274 (Formats: SSI)
MIT Vision Texture - Image archive (100+ images) (Formats: ppm)
MIT face images and more - hundreds of images (Formats: homebrew)
Machine Vision - Images from the textbook by Jain, Kasturi, Schunck (20+ images) (Formats: GIF TIFF)
Mammography Image Databases - 100 or more images of mammograms with ground truth. Additional images available by request, and links to several other mammography databases are provided. (Formats: homebrew)
ftp://ftp.cps.msu.edu/pub/prip - many images (Formats: unknown)
Middlebury Stereo Data Sets with Ground Truth - Six multi-frame stereo data sets of scenes containing planar regions. Each data set contains 9 color images and subpixel-accuracy ground-truth data. (Formats: ppm)
Middlebury Stereo Vision Research Page - Middlebury College
Modis Airborne simulator, Gallery and data set - High Altitude Imagery from around the world for environmental modeling in support of NASA EOS program (Formats: JPG and HDF)
NIST Fingerprint and handwriting - datasets - thousands of images (Formats: unknown)
NIST Fingerprint data - compressed multipart uuencoded tar file
NLM HyperDoc Visible Human Project - Color, CAT and MRI image samples - over 30 images (Formats: jpeg)
National Design Repository - Over 55,000 3D CAD and solid models of (mostly) mechanical/machined engineering designs. (Formats: gif,vrml,wrl,stp,sat)
Geometric & Intelligent Computing Laboratory
OSU (MSU) 3D Object Model Database - several sets of 3D object models collected over several years to use in object recognition research (Formats: homebrew, vrml)
OSU (MSU/WSU) Range Image Database - Hundreds of real and synthetic images (Formats: gif, homebrew)
OSU/SAMPL Database: Range Images, 3D Models, Stills, Motion Sequences - Over 1000 range images, 3D object models, still images and motion sequences (Formats: gif, ppm, vrml, homebrew)
Signal Analysis and Machine Perception Laboratory
Otago Optical Flow Evaluation Sequences - Synthetic and real sequences with machine-readable ground truth optical flow fields, plus tools to generate ground truth for new sequences. (Formats: ppm,tif,homebrew)
Vision Research Group
ftp://ftp.limsi.fr/pub/quenot/opflow/testdata/piv/ - Real and synthetic image sequences used for testing a Particle Image Velocimetry application. These images may be used for the test of optical flow and image matching algorithms. (Formats: pgm (raw))
LIMSI-CNRS/CHM/IMM/vision
LIMSI-CNRS
Photometric 3D Surface Texture Database - This is the first 3D texture database which provides both full real surface rotations and registered photometric stereo data (30 textures, 1680 images). (Formats: TIFF)
SEQUENCES FOR OPTICAL FLOW ANALYSIS (SOFA) - 9 synthetic sequences designed for testing motion analysis applications, including full ground truth of motion and camera parameters. (Formats: gif)
Computer Vision Group
Sequences for Flow Based Reconstruction - synthetic sequence for testing structure from motion algorithms (Formats: pgm)
Stereo Images with Ground Truth Disparity and Occlusion - a small set of synthetic images of a hallway with varying amounts of noise added. Use these images to benchmark your stereo algorithm. (Formats: raw, viff (khoros), or tiff)
Stuttgart Range Image Database - A collection of synthetic range images taken from high-resolution polygonal models available on the web (Formats: homebrew)
Department Image Understanding
The AR Face Database - Contains over 4,000 color images corresponding to 126 people's faces (70 men and 56 women). Frontal views with variations in facial expressions, illumination, and occlusions. (Formats: RAW (RGB 24-bit))
Purdue Robot Vision Lab
The MIT-CSAIL Database of Objects and Scenes - Database for testing multiclass object detection and scene recognition algorithms. Over 72,000 images with 2873 annotated frames. More than 50 annotated object classes. (Formats: jpg)
The RVL SPEC-DB (SPECularity DataBase) - A collection of over 300 real images of 100 objects taken under three different illuminaiton conditions (Diffuse/Ambient/Directed). -- Use these images to test algorithms for detecting and compensating specular highlights in color images. (Formats: TIFF )
Robot Vision Laboratory
The Xm2vts database - The XM2VTSDB contains four digital recordings of 295 people taken over a period of four months. This database contains both image and video data of faces.
Centre for Vision, Speech and Signal Processing
Traffic Image Sequences and 'Marbled Block' Sequence - thousands of frames of digitized traffic image sequences as well as the 'Marbled Block' sequence (grayscale images) (Formats: GIF)
IAKS/KOGS
U Bern Face images - hundreds of images (Formats: Sun rasterfile)
U Michigan textures (Formats: compressed raw)
U Oulu wood and knots database - Includes classifications - 1000+ color images (Formats: ppm)
UCID - an Uncompressed Colour Image Database - a benchmark database for image retrieval with predefined ground truth. (Formats: tiff)
UMass Vision Image Archive - Large image database with aerial, space, stereo, medical images and more. (Formats: homebrew)
UNC's 3D image database - many images (Formats: GIF)
USF Range Image Data with Segmentation Ground Truth - 80 image sets (Formats: Sun rasterimage)
University of Oulu Physics-based Face Database - contains color images of faces under different illuminants and camera calibration conditions as well as skin spectral reflectance measurements of each person.
Machine Vision and Media Processing Unit
University of Oulu Texture Database - Database of 320 surface textures, each captured under three illuminants, six spatial resolutions and nine rotation angles. A set of test suites is also provided so that texture segmentation, classification, and retrieval algorithms can be tested in a standard manner. (Formats: bmp, ras, xv)
Machine Vision Group
Usenix face database - Thousands of face images from many different sites (circa 994)
View Sphere Database - Images of 8 objects seen from many different view points. The view sphere is sampled using a geodesic with 172 images/sphere. Two sets for training and testing are available. (Formats: ppm)
PRIMA, GRAVIR
Vision-list Imagery Archive - Many images, many formats
Wiry Object Recognition Database - Thousands of images of a cart, ladder, stool, bicycle, chairs, and cluttered scenes with ground truth labelings of edges and regions. (Formats: jpg)
3D Vision Group
Yale Face Database - 165 images (15 individuals) with different lighting, expression, and occlusion configurations.
Yale Face Database B - 5760 single light source images of 10 subjects each seen under 576 viewing conditions (9 poses x 64 illumination conditions). (Formats: PGM)
Center for Computational Vision and Control
DeepMind QA Corpus - Textual QA corpus from CNN and DailyMail. More than 300K documents in total. Paper for reference.
YouTube-8M Dataset - YouTube-8M is a large-scale labeled video dataset that consists of 8 million YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities.
Open Images dataset - Open Images is a dataset of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories.
Visual Object Classes Challenge 2012 (VOC2012) - VOC2012 dataset containing 12k images with 20 annotated classes for object detection and segmentation.
Fashion-MNIST - MNIST like fashion product dataset consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.
Large-scale Fashion (DeepFashion) Database - Contains over 800,000 diverse fashion images. Each image in this dataset is labeled with 50 categories, 1,000 descriptive attributes, bounding box and clothing landmarks
FakeNewsCorpus - Contains about 10 million news articles classified using opensources.co types
databib.org
datacite.org
quandl.com - Get the data you need in the form you want; instant download, API or direct to your app.
figshare.com
GeoLite Legacy Downloadable Databases
Quora's Big Datasets Answer
Public Big Data Sets
Houston Data Portal
Kaggle Data Sources
A Deep Catalog of Human Genetic Variation
A community-curated database of well-known people, places, and things
Google Public Data
World Bank Data
NYC Taxi data
Open Data Philly Connecting people with data for Philadelphia
A list of useful sources A blog post includes many data set databases
grouplens.org Sample movie (with ratings), book and wiki datasets
UC Irvine Machine Learning Repository - contains data sets good for machine learning

[↑] Back to top

Finance Related Datasets

[↑] Back to top

Super Large Kaggle Datasets

[↑] Back to top

Numpy

IPython Notebook(s) demonstrating NumPy functionality.

Notebook	Description
numpy	Adds Python support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays.
Introduction-to-NumPy	Introduction to NumPy.
Understanding-Data-Types	Learn about data types in Python.
The-Basics-Of-NumPy-Arrays	Learn about the basics of NumPy arrays.
Computation-on-arrays-ufuncs	Learn about computations on NumPy arrays: universal functions.
Computation-on-arrays-aggregates	Learn about aggregations: min, max, and everything in between in NumPy.
Computation-on-arrays-broadcasting	Learn about computation on arrays: broadcasting in NumPy.
Boolean-Arrays-and-Masks	Learn about comparisons, masks, and boolean logic in NumPy.
Fancy-Indexing	Learn about fancy indexing in NumPy.
Sorting	Learn about sorting arrays in NumPy.
Structured-Data-NumPy	Learn about structured data: NumPy's structured arrays.

[↑] Back to top

TensorFlow

[↑] Back to top

Machine Learning & Deep Learning Tutorials

Awesome Artificial Intelligence (GitHub Repo)

[↑] Back to top

Interview-Related-Links

[↑] Back to top

Genetic Algorithms

[↑] Back to top

Kaggle Competitions WriteUp

[↑] Back to top

List of Most Starred Github Projects related to Deep Learning

Project Name	Stars	Description
tensorflow	146k	An Open Source Machine Learning Framework for Everyone
keras	48.9k	Deep Learning for humans
opencv	46.1k	Open Source Computer Vision Library
pytorch	40k	Tensors and Dynamic neural networks in Python with strong GPU acceleration
TensorFlow-Examples	38.1k	TensorFlow Tutorial and Examples for Beginners (support TF v1 & v2)
tesseract	35.3k	Tesseract Open Source OCR Engine (main repository)
face_recognition	35.2k	The world's simplest facial recognition api for Python and the command line
faceswap	31.4k	Deepfakes Software For All
transformers	30.4k	🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.
100-Days-Of-ML-Code	29.1k	100 Days of ML Coding
julia	28.1k	The Julia Language: A fresh approach to technical computing.
awesome-scalability	26.6k	The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
basics	24.5k	📚 Learn ML with clean code, simplified math and illustrative visuals.
bert	23.9k	TensorFlow code and pre-trained models for BERT
xgboost	19.4k	Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Flink and DataFlow
Real-Time-Voice-Cloning	18.4k	Clone a voice in 5 seconds to generate arbitrary speech in real-time
openpose	17.8k	OpenPose: Real-time multi-person keypoint detection library for body, face, hands, and foot estimation
Qix	13.3k	Machine Learning、Deep Learning、PostgreSQL、Distributed System、Node.Js、Golang
spleeter	12.7k	Deezer source separation library including pretrained models.
Virgilio	12.7k	Your new Mentor for Data Science E-Learning.

[↑] Back to top

Linear Regression Tutorials

General

[↑] Back to top

Logistic Regression Tutorials

[↑] Back to top

K Nearest Neighbors Tutorials

[↑] Back to top

Best Courses

Stanford CS 231N - CNNs
Stanford CS 224D - Deep Learning for NLP
Hugo Larochelle's Neural Networks Course
David Silver's Reinforcement Learning Course
Andrew Ng Machine Learning Course
Stanford CS 229 - Pretty much the same as the Coursera course
UC Berkeley Kaggle Decal
Short MIT Intro to DL Course
Udacity Deep Learning
Deep Learning School Montreal 2016 and 2017
Intro to Neural Nets and ML (Univ of Toronto)
Deep RL Bootcamp
CMU Neural Networks for NLP
Bay Area Deep Learning School Day 1 2016 and Day 2
Introduction to Deep Learning MIT Course
Caltech CS 156 - Machine Learning
Berkeley EE 227C - Convex Optimization

Top Machine Learning Podcasts

Most Important Deep Learning Papers

Some of papers thought to be influential in getting deep learning ecosystem. I further added a couple important general ML papers to these list.

Great Deep Learning Paper Year-wise

2020 year

Deep Learning Resources

Best Deep Learning Courses

Awesome Deep Learning Projects

Project Ideas for deep learning and general machine learning

List of great many projects with accomppanyting code
Bing Coronavirus
- Classify Bing Queries as either specific (e.g. about a specific location) or generic. You might have to figure out a more exact definition of specific or generic though
- Dataset: BingCoronavirusQuerySet
Covid Clinical Data
- Rank and sort high risk patients using clinical data. Pick an interpretable approach if you can.
- Dataset: CovidClinicalData

If you haven't already, checkout Kaggle's Covid19 Section as well. It has datasets and ideas both.

Autonomous Tagging of StackOverflow Questions
- Make a multi-label classification system that automatically assigns tags for questions posted on a forum such as StackOverflow or Quora.
- Dataset: StackLite or 10% sample
Keyword/Concept identification
- Identify keywords from millions of questions
- Dataset: StackOverflow question samples by Facebook
Topic identification
- Multi-label classification of printed media articles to topics
- Dataset: Greek Media monitoring multi-label classification

Natural Language Project Ideas

Automated essay grading
- The purpose of this project is to implement and train machine learning algorithms to automatically assess and grade essay responses.
- Dataset: Essays with human graded scores

Sentence to Sentence semantic similarity
- Can you identify question pairs that have the same intent or meaning?
- Dataset: Quora question pairs with similar questions marked

Fight online abuse
- Can you confidently and accurately tell whether a particular comment is abusive?
- Dataset: Toxic comments on Kaggle
Open Domain question answering
- Can you build a bot which answers questions according to the student's age or her curriculum?
- Facebook's FAIR is built in a similar way for Wikipedia.
- Dataset: NCERT books for K-12/school students in India, NarrativeQA by Google DeepMind and SQuAD by Stanford

Social Chat/Conversational Bots
- Can you build a bot which talks to you just like people talk on social networking sites?
- Reference: Chat-bot architecture
- Dataset: Reddit Dataset

Automatic text summarization
- Can you create a summary with the major points of the original document?
- Abstractive (write your own summary) and Extractive (select pieces of text from original) are two popular approaches
- Dataset: CNN and DailyMail News Pieces by Google DeepMind

Copy-cat Bot
- Generate plausible new text which looks like some other text
- Obama Speeches? For instance, you can create a bot which writes some new speeches in Obama's style
- Trump Bot? Or a Twitter bot which mimics @realDonaldTrump
- Narendra Modi bot saying "doston"? Start by scrapping off his Hindi speeches from his personal website
- Example Dataset: English Transcript of Modi speeches

Check mlm/blog for some hints.

Sentiment Analysis
- Do Twitter Sentiment Analysis on tweets sorted by geography and timestamp.
- Dataset: Tweets sentiment tagged by humans

De-anonymization
- Can you classify the text of an e-mail message to decide who sent it?
- Dataset: 150,000 Enron emails

Forecasting Project Ideas

Univariate Time Series Forecasting
- How much will it rain this year?
- Dataset: 45 years of rainfall data

Multi-variate Time Series Forecasting
- How polluted will your town's air be? Pollution Level Forecasting
- Dataset: Air Quality dataset

Demand/load forecasting
- Find a short term forecast on electricity consumption of a single home
- Dataset: Electricity consumption of a household

Predict Blood Donation
- We're interested in predicting if a blood donor will donate within a given time window.
- More on the problem statement at Driven Data.
- Dataset: UCI ML Datasets Repo

Recommendation systems Project Ideas

Movie Recommender
- Can you predict the rating a user will give on a movie?
- Do this using the movies that user has rated in the past, as well as the ratings similar users have given similar movies.
- Dataset: Netflix Prize and MovieLens Datasets

Search + Recommendation System
- Predict which Xbox game a visitor will be most interested in based on their search query
- Dataset: BestBuy

Can you predict Influencers in the Social Network?
- How can you predict social influencers?
- Dataset: PeerIndex

Some of the Best Kaggle Competitions for Beginners

Classification :

Titanic (Can't miss it :) ) : https://www.kaggle.com/c/titanic
Forest Cover Type Prediction (Late Submission): https://www.kaggle.com/c/forest-cover-type-prediction
Don't Overfit 2 (Late Submission): https://www.kaggle.com/c/dont-overfit-ii
CareerCon 2019 (Late Submission)(Multiclass): https://www.kaggle.com/c/career-con-2019
IEEE-CIS Fraud Detection (Late Submission): https://www.kaggle.com/c/ieee-fraud-detection
Instant Gratification (Late Submission): https://www.kaggle.com/c/instant-gratification
Categorical Feature Encoding Challenge (Late Submission): https://www.kaggle.com/c/cat-in-the-dat
University of Liverpool - Ion Switching (Late Submission): https://www.kaggle.com/c/liverpool-ion-switching

Binary Classification Tips and tricks

House Prices (Ongoing): https://www.kaggle.com/c/house-prices-advanced-regression-techniques
Bike Sharing Demand (Late Submission): https://www.kaggle.com/c/bike-sharing-demand/data
Predict Future Sales (Ongoing)(Time Series): https://www.kaggle.com/c/competitive-data-science-predict-future-sales
TMDB Box Office Prediction (Late Submission): https://www.kaggle.com/c/tmdb-box-office-prediction
ASHRAE - Great Energy Predictor III (Late Submission): https://www.kaggle.com/c/ashrae-energy-prediction/

Computer Vision :

Jigsaw Toxic Comment Classification Challenge (Late Submission): https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge
Jigsaw Unintended Bias in Toxicity Classification (Late Submission): https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification
Jigsaw Multilingual Toxic Comment Classification (Late Submission): https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification
Real or Not? NLP with Disaster Tweets (Ongoing): https://www.kaggle.com/c/nlp-getting-started
Word2Vec for movie reviews (Late Submission): https://www.kaggle.com/c/word2vec-nlp-tutorial
Random Acts of Pizza (Late Submission): https://www.kaggle.com/c/random-acts-of-pizza
Sentiment Analysis movies review (Late Submission): https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews
Quora questions classification (Late Submission): https://www.kaggle.com/c/quora-insincere-questions-classification
Top Computer Vision Google Colab Notebooks

Kaggle Strategies and skills

Data preparation

After data exploration, the first thing to do is to use those insights to prepare the data. To tackle issues like class imbalance, encoding categorical data, etc. Let’s see the methods used to do it.

Methods to tackle class imbalance.
Data augmentation by Synthetic Minority Oversampling Technique.
Fast inplace shuffle for augmentation.
Finding synthetic samples in the dataset.
Signal denoising used in signal processing competitions.
Finding patterns of missing data.
Methods to handle missing data.
An overview of various encoding techniques for categorical data.
Building model to predict missing values.
Random shuffling of data to create new synthetic training set.

Dealing with larger datasets

One issue you might face in any machine learning competition is the size of your data set. If the size of your data is large, that is 3GB + for kaggle kernels and more basic laptops you could find it difficult to load and process with limited resources. Here is the link to some of the articles and kernels that may be very useful in such situations.

Faster data loading with pandas.
Data compression techniques to reduce the size of data by 70%.
Optimize the memory by reducing the size of some attributes.
Use open-source libraries such as Dask to read and manipulate the data, it performs parallel computing and saves up memory space.
Use cudf.
Convert data to parquet format.
Converting data to feather format.
Reducing memory usage for optimizing RAM.

Data exploration

Data exploration always helps to better understand the data and gain insights from it. Before starting to develop machine learning models, top competitors always read/do a lot of exploratory data analysis for the data. This helps in feature engineering and cleaning of the data.

EDA for microsoft malware detection.
Time Series EDA for malware detection.
Complete EDA for home credit loan prediction.
Complete EDA for Santader prediction.
EDA for VSB Power Line Fault Detection.

Feature engineering

Next, you can check the most popular feature and feature engineering techniques used in these top kaggle competitions.

Target encoding cross validation for better encoding.
Entity embedding to handle categories.
Encoding cyclic features for deep learning.
Manual feature engineering methods.
Automated feature engineering techniques using featuretools.
Top hard crafted features used in microsoft malware detection.
Denoising NN for feature extraction.
Feature engineering using RAPIDS framework.
Things to remember while processing features using LGBM.
Lag features and moving averages.
Principal component analysis for dimensionality reduction.
LDA for dimensionality reduction.
Best hand crafted LGBM features for microsoft malware detection.
Generating frequency features.
Dropping variables with different train and test distribution.
Aggregate time series features for home credit competition.
Time Series features used in home credit default risk.
Scale, Standardize and normalize with sklearn.
Handcrafted features for Home default risk competition.
Handcrafted features used in Santander Transaction Prediction.

Feature selection

After generating many features from your data, you need to decide which all features to use in your model to get the maximum performance out of your model. This step also includes identifying the impact each feature is having on your model. Let’s see some of the most popular feature selection methods.

Six ways to do features selection using sklearn.
Permutation feature importance.
Adversarial feature validation.
Feature selection using null importances.
Tree explainer using SHAP.
DeepNN explainer using SHAP.

Modeling

After handcrafting and selecting your features, you should choose the right Machine learning algorithm to make your prediction. These are the collection of some of the most used ML models in structured data classification challenges.

rohan-paul/Awesome-Machine-Learning-DataScience_Resources

Table of Contents

If you are new to Data Science

Natural Language Processing (NLP)

Statistics

Blogs and other Community based resources

ML Math

Inspirational Stories of people breaking into Machine Learning and Data-Science

General Datasets

Finance Related Datasets

Super Large Kaggle Datasets

Numpy

TensorFlow

Machine Learning & Deep Learning Tutorials

Interview-Related-Links

Genetic Algorithms

Kaggle Competitions WriteUp

List of Most Starred Github Projects related to Deep Learning

Linear Regression Tutorials

Logistic Regression Tutorials

K Nearest Neighbors Tutorials

Best Courses

Top Machine Learning Podcasts

Most Important Deep Learning Papers

Great Deep Learning Paper Year-wise

Deep Learning Resources

Best Deep Learning Courses

Awesome Deep Learning Projects

Project Ideas for deep learning and general machine learning

Natural Language Project Ideas

Forecasting Project Ideas

Recommendation systems Project Ideas

Some of the Best Kaggle Competitions for Beginners

Kaggle Strategies and skills