/awesome-datasets

Curated datasets for machine learning tasks according to use cases

MIT LicenseMIT

awesome-datasets

Curated datasets for machine learning tasks according to use cases adapted from a now defunct article on Kaggle.

For each type of analysis think about:

  • What problem does it solve and for who?
  • How is it being solved today?
  • What are the data inputs and where do they come from?
  • What are the outputs and how are they consumed? Online models, static or dynamic reports?
  • Is it a revenue leakage (“saves us money”) or a revenue growth (“makes us money”) problem?

Use Cases By Functions and Verticals

Marketing

Demand Forecasting

Forecast volumes of sales, inventory needed, etc.

Predicting Lifetime Value / Recency-Frequency Matrix

Identify the most lucrative and loyal segments of your customers

  • Lifetimes - Synthetic data and library for calculating CLV
  • CDNow - CDNow transaction records

Churn / Up-sell

Identify characteristics and timing of customer churns/upgrades in order to prevent/encourage them

Customer Segmentation

Identify main customer clusters and their characteristics

Product Grouping / Category Tree

Group products together in the most reasonable category trees

Cross-selling / Recommendation / Market Basket Analysis

Identify which products a customer is going to buy based on past purchases

Explicit Ratings

Implicit Ratings

Channel Attribution and Optimization

Allocate credits fairly to all ads channels and have portfolio for your ads spending

Ad Optimization

Predict and price impressions, clicks, conversions or any performance metrics for ads

Ad Fraud

Detect ad click/install frauds

Dynamic Pricing

Optimal price for growth, profit, customer retention, etc.

Store Layout Optimization

Optimal store/website layout for growth, profit, customer retention, etc.

Customer Feedback

Text classification to determine customer feedbacks/sentiment about your products

Customer Support

Question Answering

Generate natural language answers based on given context and questions

  • SQuAD - Stanford Question Answering Dataset

Wait Time Prediction

Predict wait time based on customer history, time of day, call volumes, products owned, churn risk, LTV, etc.

Human Resources

Resume screening

Score candidates based on resumes and internal records

Employee Churn

Predicts which employees are most likely to leave

Healthcare

Medical Image Classification

Classify medical images according to conditions

Readmission risk

Predict risk of re-admittance based on patient attributes, medical history, diagnose & treatment

Patient Report Summary

Generate natural language reports based on tabular data

Automated Triage

Classify patients according to their initial complaints

Hospital Operations Management

Optimize/predict operating theatre & bed occupancy based on initial patient visits

Real-time Patient Monitoring

Activity monitoring of patients

  • OPPORTUNITY - Dataset for Human Activity Recognition from Wearable, Object, and Ambient Sensors
  • PAMAP2 - Physical Activity Monitoring Data Set

Survival Analysis

Predict survival rates of patients

Dosage Effectiveness

Analyse effects of admitting different types and dosage of medication for a disease

Media

News Summary

Generate short length descriptions of news articles.

Insurance

Claim Prediction

Predict timing and size of claims

Claim Fraud

Outlier detection for insurance claim fraud

Policy Prediction

Predict type of insurance

Finance

Credit Scoring / Loan Approval / Debt Recovery

Predict which customers are going to default

Portfolio Optimization

Optimize portfolio of assets according to risks and returns

  • quantmod - library for financial modeling in R; APIs for downloading fundamental and technical data
  • Stanford EE103 - Popular ETFs from 2006 to 2016

Automated Trading

Trade financial assets using automated models

Fraud Detection

Identify fraudulent transactions and parties with outlier detection and network analysis

Manufacturing

Quality Control

Detect malfunctioning pieces with computer vision

Process Optimization

Find bottlenecks in manufacturing processes

Warranty Analytics

Predict your products' rate and timing of failures

Design

Design new products

Agriculture, Geography and Environment

Yield Forecasting

Forecast agricultural yields

Satellite Image Classification and Extraction

Air Quality

Wildlife Classification

Classify wild animals

Real Estate

Pricing

Predict real estate values based on their characteristics

Education

Automated Essay Scoring

Score essays based on past pieces

Utilities

Distribution Network Optimization

Optimize distribution networks of electricity, water, etc.