- UCI Machine Learning Repository - The classic repository of datasets
- Extreme Classification Repository
- Relational Data Repository
- Kaggle Datasets
- Registry of Open Data on AWS
- CrowdANALYTIX dataX
- The Dataverse Project
- mldata.org
- OpenML
- ACM RecSys Datasets
- WADAM Dataset Repository
- data.world - Something like GitHub for data.
- Mendeley Data - Share your data platform. You can browse for datasets.
- datahub
- Keras
- Yahoo webscope
- Harvard dataverse
- TensorFlow datasets
- Google Research Datasets
- Quandl - Financial related datasets
- CryptoDataDownload - Free cryptocurrency data
- Instacart - 3 Million Instacart Orders, Open Sourced
- Supermarket data
- tera-PROMISE Repository, research dataset repository specializing in software engineering research datasets
- Bug Prediction Dataset
- Eclipse Bug Data
- FLOSSMetrics
- FLOSSMole
- International Software Benchmarking Standards Group (IBSBSG)
- PROMISE
- Qualitas Corpus
- Software Artifact Repository
- SourceForge Research Data
- Sourcerer Project
- Tukutuku
- Ultimate Debian Database
- Learning from Big Code datasets - Include datasets with Abstract Syntax Trees, Java source code, binaries, etc.
- awesome-msr - Datasets gathered from the Mining Software Repositories (MSR) community
- ABB Dev Interaction Data - Over 30,000 hours of developer interaction data in Visual Studio captured with the Blaze tool
- Django dataset - Django Dataset for Code Translation Tasks
- WikiSQL - A large annotated semantic parsing corpus for developing natural language interfaces
- text2sql-data - data and code for building and evaluating systems that map sentences to SQL
- Neural-Code-Search-Evaluation-Dataset - Neural-Code-Search-Evaluation-Dataset by Facebook, presents an evaluation dataset consisting of natural language query and code snippet pairs
- CodeSearchNet - CodeSearchNet by GitHub is a collection of datasets and benchmarks that explore the problem of code retrieval using natural language.
- awesome machine learning on source code datasets - The datasets part of the MLonCode repository.
- source{d} datasets - source{d} datasets for source code analysis and machine learning on source code (ML on Code).
- Blog post with a collection of NER datasets
- NER datasets
- Named Entity Recognition for Chinese social media (Weibo)
- Yahoo, Computing Systems Data for Anomaly Detection
- Alibaba cluster data
- Azure public dataset
- Google cluster data
- Loghub dataset from LogPAI - a collection of system logs