/Awesome-Machine-Learning-DataScience_Resources

Curated Collection of Online and Free Resources for serious learning of Machine Learning and Data Science.

Table of Contents

  1. If you are new to Data Science
  2. Natural Language Processing (NLP)
  3. Statistics
  4. Blogs and other Community based resources
  5. ML Math
  6. Inspirational Stories of people breaking into Machine Learning and Data-Science
  7. General Datasets
  8. Finance Related Datasets
  9. Super Large Kaggle Datasets
  10. Numpy
  11. TensorFlow
  12. Machine Learning & Deep Learning Tutorials
  13. Interview-Related-Links
  14. Genetic Algorithms
  15. Kaggle Competitions WriteUp
  16. List of Most Starred Github Projects related to Deep Learning
  17. Linear Regression Tutorials
  18. Logistic Regression Tutorials
  19. K Nearest Neighbors Tutorials
  20. Best Courses
  21. Top Machine Learning Podcasts
  22. Most Important Deep Learning Papers
  23. Great Deep Learning Paper Year-wise
  24. Deep Learning Resources
  25. Best Deep Learning Courses
  26. Awesome Deep Learning Projects
  27. Project Ideas for deep learning and general machine learning
  28. Natural Language Project Ideas
  29. Forecasting Project Ideas
  30. Recommendation systems Project Ideas
  31. Some of the Best Kaggle Competitions for Beginners
  32. Kaggle Strategies and skills

[↑] Back to top

If you are new to Data Science

Preview Description
Key differences of a data scientist vs. data engineer
A visual guide to Becoming a Data Scientist in 8 Steps by DataCamp (img)
Mindmap on required skills (img)
Swami Chandrasekaran made a Curriculum via Metro map.
by @kzawadz via twitter
By Data Science Central
From this article by Berkeley Science Review.
Data Science Wars: R vs Python
By Data Science Central
From this article by Berkeley Science Review.
Data Science Wars: R vs Python

[↑] Back to top

Natural Language Processing (NLP)

[↑] Back to top

Statistics

[↑] Back to top

Blogs and other Community based resources

[↑] Back to top

ML Math

[↑] Back to top

Inspirational Stories of people breaking into Machine Learning and Data-Science

[↑] Back to top

General Datasets

  1. MNIST Handwritten digits
  2. Google House Numbers from street view
  3. CIFAR-10 and CIFAR-100
  4. Academic Torrents
  5. hadoopilluminated.com
  6. data.gov - The home of the U.S. Government's open data
  7. IMAGENET
  8. Tiny Images 80 Million tiny images6.
  9. Flickr Data 100 Million Yahoo dataset
  10. Berkeley Segmentation Dataset 500
  11. UC Irvine Machine Learning Repository
  12. Flickr 8k
  13. Flickr 30k
  14. Microsoft COCO
  15. VQA
  16. enigma.com - Navigate the world of public data - Quickly search and analyze billions of public records published by governments, companies and organizations.
  17. datahub.io
  18. United States Census Bureau
  19. Image QA
  20. AT&T Laboratories Cambridge face database
  21. AVHRR Pathfinder
  22. usgovxml.com
  23. aws.amazon.com/datasets
  24. Air Freight - The Air Freight data set is a ray-traced image sequence along with ground truth segmentation based on textural characteristics. (455 images + GT, each 160x120 pixels). (Formats: PNG)
  25. Amsterdam Library of Object Images - ALOI is a color image collection of one-thousand small objects, recorded for scientific purposes. In order to capture the sensory variation in object recordings, we systematically varied viewing angle, illumination angle, and illumination color for each object, and additionally captured wide-baseline stereo images. We recorded over a hundred images of each object, yielding a total of 110,250 images for the collection. (Formats: png)
  26. Annotated face, hand, cardiac & meat images - Most images & annotations are supplemented by various ASM/AAM analyses using the AAM-API. (Formats: bmp,asf)
  27. Image Analysis and Computer Graphics
  28. Brown University Stimuli - A variety of datasets including geons, objects, and "greebles". Good for testing recognition algorithms. (Formats: pict)
  29. CAVIAR video sequences of mall and public space behavior - 90K video frames in 90 sequences of various human activities, with XML ground truth of detection and behavior classification (Formats: MPEG2 & JPEG)
  30. Machine Vision Unit
  31. CCITT Fax standard images - 8 images (Formats: gif)
  32. CMU CIL's Stereo Data with Ground Truth - 3 sets of 11 images, including color tiff images with spectroradiometry (Formats: gif, tiff)
  33. CMU PIE Database - A database of 41,368 face images of 68 people captured under 13 poses, 43 illuminations conditions, and with 4 different expressions.
  34. CMU VASC Image Database - Images, sequences, stereo pairs (thousands of images) (Formats: Sun Rasterimage)
  35. Caltech Image Database - about 20 images - mostly top-down views of small objects and toys. (Formats: GIF)
  36. Columbia-Utrecht Reflectance and Texture Database - Texture and reflectance measurements for over 60 samples of 3D texture, observed with over 200 different combinations of viewing and illumination directions. (Formats: bmp)
  37. Computational Colour Constancy Data - A dataset oriented towards computational color constancy, but useful for computer vision in general. It includes synthetic data, camera sensor data, and over 700 images. (Formats: tiff)
  38. Computational Vision Lab
  39. Content-based image retrieval database - 11 sets of color images for testing algorithms for content-based retrieval. Most sets have a description file with names of objects in each image. (Formats: jpg)
  40. Efficient Content-based Retrieval Group
  41. Densely Sampled View Spheres - Densely sampled view spheres - upper half of the view sphere of two toy objects with 2500 images each. (Formats: tiff)
  42. Computer Science VII (Graphical Systems)
  43. Digital Embryos - Digital embryos are novel objects which may be used to develop and test object recognition systems. They have an organic appearance. (Formats: various formats are available on request)
  44. Univerity of Minnesota Vision Lab
  45. El Salvador Atlas of Gastrointestinal VideoEndoscopy - Images and Videos of his-res of studies taken from Gastrointestinal Video endoscopy. (Formats: jpg, mpg, gif)
  46. FG-NET Facial Aging Database - Database contains 1002 face images showing subjects at different ages. (Formats: jpg)
  47. FVC2000 Fingerprint Databases - FVC2000 is the First International Competition for Fingerprint Verification Algorithms. Four fingerprint databases constitute the FVC2000 benchmark (3520 fingerprints in all).
  48. Biometric Systems Lab - University of Bologna
  49. Face and Gesture images and image sequences - Several image datasets of faces and gestures that are ground truth annotated for benchmarking
  50. German Fingerspelling Database - The database contains 35 gestures and consists of 1400 image sequences that contain gestures of 20 different persons recorded under non-uniform daylight lighting conditions. (Formats: mpg,jpg)
  51. Language Processing and Pattern Recognition
  52. Groningen Natural Image Database - 4000+ 1536x1024 (16 bit) calibrated outdoor images (Formats: homebrew)
  53. ICG Testhouse sequence - 2 turntable sequences from ifferent viewing heights, 36 images each, resolution 1000x750, color (Formats: PPM)
  54. Institute of Computer Graphics and Vision
  55. IEN Image Library - 1000+ images, mostly outdoor sequences (Formats: raw, ppm)
  56. INRIA's Syntim images database - 15 color image of simple objects (Formats: gif)
  57. INRIA
  58. INRIA's Syntim stereo databases - 34 calibrated color stereo pairs (Formats: gif)
  59. Image Analysis Laboratory - Images obtained from a variety of imaging modalities -- raw CFA images, range images and a host of "medical images". (Formats: homebrew)
  60. Image Analysis Laboratory
  61. Image Database - An image database including some textures
  62. JAFFE Facial Expression Image Database - The JAFFE database consists of 213 images of Japanese female subjects posing 6 basic facial expressions as well as a neutral pose. Ratings on emotion adjectives are also available, free of charge, for research purposes. (Formats: TIFF Grayscale images.)
  63. ATR Research, Kyoto, Japan
  64. JISCT Stereo Evaluation - 44 image pairs. These data have been used in an evaluation of stereo analysis, as described in the April 1993 ARPA Image Understanding Workshop paper ``The JISCT Stereo Evaluation'' by R.C.Bolles, H.H.Baker, and M.J.Hannah, 263--274 (Formats: SSI)
  65. MIT Vision Texture - Image archive (100+ images) (Formats: ppm)
  66. MIT face images and more - hundreds of images (Formats: homebrew)
  67. Machine Vision - Images from the textbook by Jain, Kasturi, Schunck (20+ images) (Formats: GIF TIFF)
  68. Mammography Image Databases - 100 or more images of mammograms with ground truth. Additional images available by request, and links to several other mammography databases are provided. (Formats: homebrew)
  69. ftp://ftp.cps.msu.edu/pub/prip - many images (Formats: unknown)
  70. Middlebury Stereo Data Sets with Ground Truth - Six multi-frame stereo data sets of scenes containing planar regions. Each data set contains 9 color images and subpixel-accuracy ground-truth data. (Formats: ppm)
  71. Middlebury Stereo Vision Research Page - Middlebury College
  72. Modis Airborne simulator, Gallery and data set - High Altitude Imagery from around the world for environmental modeling in support of NASA EOS program (Formats: JPG and HDF)
  73. NIST Fingerprint and handwriting - datasets - thousands of images (Formats: unknown)
  74. NIST Fingerprint data - compressed multipart uuencoded tar file
  75. NLM HyperDoc Visible Human Project - Color, CAT and MRI image samples - over 30 images (Formats: jpeg)
  76. National Design Repository - Over 55,000 3D CAD and solid models of (mostly) mechanical/machined engineering designs. (Formats: gif,vrml,wrl,stp,sat)
  77. Geometric & Intelligent Computing Laboratory
  78. OSU (MSU) 3D Object Model Database - several sets of 3D object models collected over several years to use in object recognition research (Formats: homebrew, vrml)
  79. OSU (MSU/WSU) Range Image Database - Hundreds of real and synthetic images (Formats: gif, homebrew)
  80. OSU/SAMPL Database: Range Images, 3D Models, Stills, Motion Sequences - Over 1000 range images, 3D object models, still images and motion sequences (Formats: gif, ppm, vrml, homebrew)
  81. Signal Analysis and Machine Perception Laboratory
  82. Otago Optical Flow Evaluation Sequences - Synthetic and real sequences with machine-readable ground truth optical flow fields, plus tools to generate ground truth for new sequences. (Formats: ppm,tif,homebrew)
  83. Vision Research Group
  84. ftp://ftp.limsi.fr/pub/quenot/opflow/testdata/piv/ - Real and synthetic image sequences used for testing a Particle Image Velocimetry application. These images may be used for the test of optical flow and image matching algorithms. (Formats: pgm (raw))
  85. LIMSI-CNRS/CHM/IMM/vision
  86. LIMSI-CNRS
  87. Photometric 3D Surface Texture Database - This is the first 3D texture database which provides both full real surface rotations and registered photometric stereo data (30 textures, 1680 images). (Formats: TIFF)
  88. SEQUENCES FOR OPTICAL FLOW ANALYSIS (SOFA) - 9 synthetic sequences designed for testing motion analysis applications, including full ground truth of motion and camera parameters. (Formats: gif)
  89. Computer Vision Group
  90. Sequences for Flow Based Reconstruction - synthetic sequence for testing structure from motion algorithms (Formats: pgm)
  91. Stereo Images with Ground Truth Disparity and Occlusion - a small set of synthetic images of a hallway with varying amounts of noise added. Use these images to benchmark your stereo algorithm. (Formats: raw, viff (khoros), or tiff)
  92. Stuttgart Range Image Database - A collection of synthetic range images taken from high-resolution polygonal models available on the web (Formats: homebrew)
  93. Department Image Understanding
  94. The AR Face Database - Contains over 4,000 color images corresponding to 126 people's faces (70 men and 56 women). Frontal views with variations in facial expressions, illumination, and occlusions. (Formats: RAW (RGB 24-bit))
  95. Purdue Robot Vision Lab
  96. The MIT-CSAIL Database of Objects and Scenes - Database for testing multiclass object detection and scene recognition algorithms. Over 72,000 images with 2873 annotated frames. More than 50 annotated object classes. (Formats: jpg)
  97. The RVL SPEC-DB (SPECularity DataBase) - A collection of over 300 real images of 100 objects taken under three different illuminaiton conditions (Diffuse/Ambient/Directed). -- Use these images to test algorithms for detecting and compensating specular highlights in color images. (Formats: TIFF )
  98. Robot Vision Laboratory
  99. The Xm2vts database - The XM2VTSDB contains four digital recordings of 295 people taken over a period of four months. This database contains both image and video data of faces.
  100. Centre for Vision, Speech and Signal Processing
  101. Traffic Image Sequences and 'Marbled Block' Sequence - thousands of frames of digitized traffic image sequences as well as the 'Marbled Block' sequence (grayscale images) (Formats: GIF)
  102. IAKS/KOGS
  103. U Bern Face images - hundreds of images (Formats: Sun rasterfile)
  104. U Michigan textures (Formats: compressed raw)
  105. U Oulu wood and knots database - Includes classifications - 1000+ color images (Formats: ppm)
  106. UCID - an Uncompressed Colour Image Database - a benchmark database for image retrieval with predefined ground truth. (Formats: tiff)
  107. UMass Vision Image Archive - Large image database with aerial, space, stereo, medical images and more. (Formats: homebrew)
  108. UNC's 3D image database - many images (Formats: GIF)
  109. USF Range Image Data with Segmentation Ground Truth - 80 image sets (Formats: Sun rasterimage)
  110. University of Oulu Physics-based Face Database - contains color images of faces under different illuminants and camera calibration conditions as well as skin spectral reflectance measurements of each person.
  111. Machine Vision and Media Processing Unit
  112. University of Oulu Texture Database - Database of 320 surface textures, each captured under three illuminants, six spatial resolutions and nine rotation angles. A set of test suites is also provided so that texture segmentation, classification, and retrieval algorithms can be tested in a standard manner. (Formats: bmp, ras, xv)
  113. Machine Vision Group
  114. Usenix face database - Thousands of face images from many different sites (circa 994)
  115. View Sphere Database - Images of 8 objects seen from many different view points. The view sphere is sampled using a geodesic with 172 images/sphere. Two sets for training and testing are available. (Formats: ppm)
  116. PRIMA, GRAVIR
  117. Vision-list Imagery Archive - Many images, many formats
  118. Wiry Object Recognition Database - Thousands of images of a cart, ladder, stool, bicycle, chairs, and cluttered scenes with ground truth labelings of edges and regions. (Formats: jpg)
  119. 3D Vision Group
  120. Yale Face Database - 165 images (15 individuals) with different lighting, expression, and occlusion configurations.
  121. Yale Face Database B - 5760 single light source images of 10 subjects each seen under 576 viewing conditions (9 poses x 64 illumination conditions). (Formats: PGM)
  122. Center for Computational Vision and Control
  123. DeepMind QA Corpus - Textual QA corpus from CNN and DailyMail. More than 300K documents in total. Paper for reference.
  124. YouTube-8M Dataset - YouTube-8M is a large-scale labeled video dataset that consists of 8 million YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities.
  125. Open Images dataset - Open Images is a dataset of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories.
  126. Visual Object Classes Challenge 2012 (VOC2012) - VOC2012 dataset containing 12k images with 20 annotated classes for object detection and segmentation.
  127. Fashion-MNIST - MNIST like fashion product dataset consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.
  128. Large-scale Fashion (DeepFashion) Database - Contains over 800,000 diverse fashion images. Each image in this dataset is labeled with 50 categories, 1,000 descriptive attributes, bounding box and clothing landmarks
  129. FakeNewsCorpus - Contains about 10 million news articles classified using opensources.co types
  130. databib.org
  131. datacite.org
  132. quandl.com - Get the data you need in the form you want; instant download, API or direct to your app.
  133. figshare.com
  134. GeoLite Legacy Downloadable Databases
  135. Quora's Big Datasets Answer
  136. Public Big Data Sets
  137. Houston Data Portal
  138. Kaggle Data Sources
  139. A Deep Catalog of Human Genetic Variation
  140. A community-curated database of well-known people, places, and things
  141. Google Public Data
  142. World Bank Data
  143. NYC Taxi data
  144. Open Data Philly Connecting people with data for Philadelphia
  145. A list of useful sources A blog post includes many data set databases
  146. grouplens.org Sample movie (with ratings), book and wiki datasets
  147. UC Irvine Machine Learning Repository - contains data sets good for machine learning

[↑] Back to top

Finance Related Datasets

[↑] Back to top

Super Large Kaggle Datasets

[↑] Back to top

Numpy

IPython Notebook(s) demonstrating NumPy functionality.

Notebook Description
numpy Adds Python support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays.
Introduction-to-NumPy Introduction to NumPy.
Understanding-Data-Types Learn about data types in Python.
The-Basics-Of-NumPy-Arrays Learn about the basics of NumPy arrays.
Computation-on-arrays-ufuncs Learn about computations on NumPy arrays: universal functions.
Computation-on-arrays-aggregates Learn about aggregations: min, max, and everything in between in NumPy.
Computation-on-arrays-broadcasting Learn about computation on arrays: broadcasting in NumPy.
Boolean-Arrays-and-Masks Learn about comparisons, masks, and boolean logic in NumPy.
Fancy-Indexing Learn about fancy indexing in NumPy.
Sorting Learn about sorting arrays in NumPy.
Structured-Data-NumPy Learn about structured data: NumPy's structured arrays.

[↑] Back to top

TensorFlow

[↑] Back to top

Machine Learning & Deep Learning Tutorials

[↑] Back to top

Interview-Related-Links

[↑] Back to top

Genetic Algorithms

[↑] Back to top

Kaggle Competitions WriteUp

[↑] Back to top

List of Most Starred Github Projects related to Deep Learning

Project Name Stars Description
tensorflow 146k An Open Source Machine Learning Framework for Everyone
keras 48.9k Deep Learning for humans
opencv 46.1k Open Source Computer Vision Library
pytorch 40k Tensors and Dynamic neural networks in Python with strong GPU acceleration
TensorFlow-Examples 38.1k TensorFlow Tutorial and Examples for Beginners (support TF v1 & v2)
tesseract 35.3k Tesseract Open Source OCR Engine (main repository)
face_recognition 35.2k The world's simplest facial recognition api for Python and the command line
faceswap 31.4k Deepfakes Software For All
transformers 30.4k 🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.
100-Days-Of-ML-Code 29.1k 100 Days of ML Coding
julia 28.1k The Julia Language: A fresh approach to technical computing.
awesome-scalability 26.6k The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
basics 24.5k 📚 Learn ML with clean code, simplified math and illustrative visuals.
bert 23.9k TensorFlow code and pre-trained models for BERT
xgboost 19.4k Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Flink and DataFlow
Real-Time-Voice-Cloning 18.4k Clone a voice in 5 seconds to generate arbitrary speech in real-time
openpose 17.8k OpenPose: Real-time multi-person keypoint detection library for body, face, hands, and foot estimation
Qix 13.3k Machine Learning、Deep Learning、PostgreSQL、Distributed System、Node.Js、Golang
spleeter 12.7k Deezer source separation library including pretrained models.
Virgilio 12.7k Your new Mentor for Data Science E-Learning.

[↑] Back to top

Linear Regression Tutorials

[↑] Back to top

Logistic Regression Tutorials

[↑] Back to top

K Nearest Neighbors Tutorials

[↑] Back to top

Best Courses

Top Machine Learning Podcasts

Most Important Deep Learning Papers

Some of papers thought to be influential in getting deep learning ecosystem. I further added a couple important general ML papers to these list.

Great Deep Learning Paper Year-wise

Deep Learning Resources

Best Deep Learning Courses

Awesome Deep Learning Projects

Project Ideas for deep learning and general machine learning

If you haven't already, checkout Kaggle's Covid19 Section as well. It has datasets and ideas both.

Natural Language Project Ideas

  • Automated essay grading
    • The purpose of this project is to implement and train machine learning algorithms to automatically assess and grade essay responses.
    • Dataset: Essays with human graded scores
  • Sentence to Sentence semantic similarity
    • Can you identify question pairs that have the same intent or meaning?
    • Dataset: Quora question pairs with similar questions marked
  • Social Chat/Conversational Bots
  • Automatic text summarization
    • Can you create a summary with the major points of the original document?
    • Abstractive (write your own summary) and Extractive (select pieces of text from original) are two popular approaches
    • Dataset: CNN and DailyMail News Pieces by Google DeepMind

Check mlm/blog for some hints.

  • De-anonymization
    • Can you classify the text of an e-mail message to decide who sent it?
    • Dataset: 150,000 Enron emails

Forecasting Project Ideas

  • Multi-variate Time Series Forecasting
    • How polluted will your town's air be? Pollution Level Forecasting
    • Dataset: Air Quality dataset
  • Predict Blood Donation
    • We're interested in predicting if a blood donor will donate within a given time window.
    • More on the problem statement at Driven Data.
    • Dataset: UCI ML Datasets Repo

Recommendation systems Project Ideas

  • Movie Recommender
    • Can you predict the rating a user will give on a movie?
    • Do this using the movies that user has rated in the past, as well as the ratings similar users have given similar movies.
    • Dataset: Netflix Prize and MovieLens Datasets
  • Search + Recommendation System
    • Predict which Xbox game a visitor will be most interested in based on their search query
    • Dataset: BestBuy
  • Can you predict Influencers in the Social Network?
    • How can you predict social influencers?
    • Dataset: PeerIndex

Some of the Best Kaggle Competitions for Beginners

Classification :

Binary Classification Tips and tricks

Computer Vision :

Kaggle Strategies and skills

Data preparation

After data exploration, the first thing to do is to use those insights to prepare the data. To tackle issues like class imbalance, encoding categorical data, etc. Let’s see the methods used to do it.

Dealing with larger datasets

One issue you might face in any machine learning competition is the size of your data set. If the size of your data is large, that is 3GB + for kaggle kernels and more basic laptops you could find it difficult to load and process with limited resources. Here is the link to some of the articles and kernels that may be very useful in such situations.

Data exploration

Data exploration always helps to better understand the data and gain insights from it. Before starting to develop machine learning models, top competitors always read/do a lot of exploratory data analysis for the data. This helps in feature engineering and cleaning of the data.

Feature engineering

Next, you can check the most popular feature and feature engineering techniques used in these top kaggle competitions.

Feature selection

After generating many features from your data, you need to decide which all features to use in your model to get the maximum performance out of your model. This step also includes identifying the impact each feature is having on your model. Let’s see some of the most popular feature selection methods.

Modeling

After handcrafting and selecting your features, you should choose the right Machine learning algorithm to make your prediction. These are the collection of some of the most used ML models in structured data classification challenges.