The following is a list of publicly availables datasets for various machine learning tasks. Reviews, fixes, dead links and updates are appreciated.
Please provide due credit by adding below in the Acknowledgments
section the links to the corresponding sources.
- Data-artikelen | Sargasso
- Data journalism and data visualization from the Datablog | News | The Guardian
- Knoema – Home
- Public Data Sets : Amazon Web Services
- Socrata
- Data Publica | Les données pour votre business
- Archive-It – Web Archiving Services for Libraries and Archives
- Freebase
- Google Public Data Explorer
- Welcome – the Data Hub
- Data Sets | AggData
- Find & Purchase Data Subscriptions | Windows Azure Marketplace
- Factual | Home
- IMF Data and Statistics
- Data | The World Bank
- OECD.Stat
- UNdata
- Data and maps — European Environment Agency (EEA)
- Eurostat Home
- Inicio Misiones
- Open Government Data Wien (OGD)
- Open data – City of Brussels
- Open Data – Brisbane City Council
- Open data – Salford City Council
- Sunderland City Council : Local Public Data
- Welcome to the London Datastore | London DataStore
- Leeds City Council – Open Data
- Home – DataGM – Data Greater Manchester
- Open Data | Derby City Council
- Council data – Brighton & Hove City Council
- Open Data – Birmingham City Council
- Aberdeen City Council Open Data
- Open Data – City of Waterloo
- Open Data catalogue | City of Vancouver
- Open Data Home – Open Data – Home | City of Toronto
- City of Prince George – Open Data Catalogue
- Open Data Ottawa | City of Ottawa
- Open Data Catalogue – City of Red Deer
- Open Data | City of Niagara Falls, Canada
- Open Data Catalogue | City of Nanaimo
- Mississauga.ca – Residents – Publications and Open Data Catalogue
- City of Medicine Hat Open Data Catalogue
- Kamloops open data
- Open Data Catalogue Kelowna
- City of Hamilton – Open Data
- City of Fredericton – Open Data Home
- City of Edmonton Open Data Catalogue
- City of Somerville, MA
- Data.Seattle.Gov | Seattle’s Data Site
- City of Scottsdale
- Welcome – Santa Cruz Open Data
- Data | San Francisco
- Open Raleigh – The Official City of Raleigh Portal
- Datasets | CivicApps.org Portland OR
- OpenDataPhilly – Connecting People With Data
- NYC Open Data
- Greater New Orleans Community Data Center
- City of Madison | Open Data
- City and County of Honolulu
- US/Data Catalog District of Columbia
- Denver Open Data Catalog
- data.cookcountyil.gov | The Cook County Government Open Data Website
- City of Chicago | Data Portal
- Open Government | City of Boston
- OpenBaltimore / City of Baltimore’s Open Data Catalog
- Data.AustinTexas.gov | Open Austin
- OpenDataAsheville – Connecting People With Data
- US/Arvada
- GovHK: About Data.One
- data.gov.sg Singapore
- ACM KDD CUP
- Competitions – Kaggle
- Data – Repository – Causality Workbench
- TunedIT – Data mining & machine learning data sets, algorithms, challenges
- IHME | Institute for Health Metrics and Evaluation
- Gapminder: Unveiling the beauty of statistics for a fact based world view.
- Doing Research in New York City Public Schools and Requesting Data – NYC Data – New York City Department of Education
- RITA | BTS | Title from h2
- Oregon Climate Data
- Quantnet :: Start
- Data Tools – Locators
- My Data | Measured Me
- Webscope from Yahoo! Labs
- SoourceForge.net Research Data
- Online Data – Robert Shiller
- Obtaining Data From the NSSDC
- Cancer Program Data Sets
- Million Song Dataset | scaling MIR research
- Google Ngram Viewer
- Data | GeoDa Center
- Home – GEO DataSets – NCBI
- The Financial Data Finder A – G
- Frequent Itemset Mining Dataset Repository
- Europeana Professional – Linked Open Data
- Inforum – EconData
- Summary of Data Sets by Application Area
- Data Sets | Pew Research Center’s Internet & American Life Project
- Cosm – Explore
- Advanced NFL Stats: Play-by-Play Data
- Portal de Obligaciones de Transparencia
- Junta de Andalucía – Datos abiertos
- Reutilización de la Información del Sector Público | Reutilización de la Información de los Servicios Públicos
- Portal de Datos Abiertos de JCCM
- Ayuntamiento de Zaragoza. Datos de Zaragoza Reutilización
- Dades obertes Lleida – Ajuntament de Lleida
- ISTAC | El ISTAC
- Dades Obertes. Generalitat de Catalunya
- Dades Obertes CAIB
- Reutilización de la Información del Sector Público en Gijón
- Open Data Euskadi ataria, Eusko Jaurlaritzaren datu publikoen irekitzea
- Data for Hawaii | data.hawaii.gov
- Florida Has A Right To Know
- Open.Georgia.gov
- Commonwealth Data Point
- Open Data | data.maryland.gov
- Connecticut Transparency Website
- RI.gov: Open Data
- NYS Data Center
- Maine.gov DataShare
- State of Alabama – Open.alabama.gov
- Open Government for the State of Tennessee
- Ohio.gov | Government | State Facts and History
- OpenDoor – Kentucky
- Data.Illinois.gov | Open Illinois
- SOM – Michigan Data Store
- Louisiana Transparency and Accountability Portal
- data.mo.gov | State of Missouri Data Portal
- DATAshare | data.iowa.gov
- Minnesota open data // your portal for Minnesota data transparency
- Open Data Texas
- Welcome to Oklahoma’s Official Web Site
- KanView: Kansas Transparency Taxpayer Act – Kansas Revenues and Expenditures Search
- OPEN SD :: South Dakota Government Information
- North Dakota GIS (Geographic Information Systems)
- State Government Data New Mexico
- Colorado.gov: The Official State Web Portal
- Arizona OpenBooks | – Arizona Transparency Finances in Detail
- Utah Data – Utah.gov
- Data.CA.gov | Data Transparency for the State of California
- Oregon Data | Opening Oregon’s Data
- Data.Washington | Washington State’s Data Site
- Home | Data.gov
- Portal de Datos Públicos – Inicio
- datos.gub.uy | Portal del Estado Uruguayo
- Bem vindo – Portal Brasileiro de Dados Abertos
- Directorio de Empresas, Marcas registradas, Normas legales y Teléfonos en Perú
- StatCentral.ie – The Portal to Ireland’s Official Statistics
- data.gov.be | The Belgian open data initiative
- Data.overheid.nl: het open dataportaal van de Nederlandse overheid
- PortalU – German Environmental Information Portal
- Statistical database
- Date.gov.md | Portalul datelor guvernamentale deschise al Republicii Moldova
- Offene Daten Österreich | data.gv.at
- Vitajte – data.gov.sk
- dati.gov.it | I dati aperti della PA
- Δημοσια, Ανοικτά Δεδομένα
- Open Kenya | Transparent Africa
- SAUDI | National e-Government Portal – Home
- data.govt.nz – New Zealand government data online » Data.govt.nz
- data.gov.au
- 국가공유자원포털
- **政府公开信息整合服务平台
- Open Data Canada
- OpenGovData.ru
- OpenAid – Start
- data.norge.no | Åpne offentlige data i Norge – Difi
- Portada | datos.gob.es
- Open Data Colombia
- home | data.gov.uk
- Programming Challenges: What are some good “toy problems” in data science? – Quora
- Data: Where can I find large datasets open to the public? – Quora
- Data Analysis: What’s your favorite free data source? – Quora
- What are some publicly available market data feeds? – Quora
- Is there a reliable free source for per country LinkedIn statistics? – Quora
- @pskomoroch #dataset – Delicious
- Free, Public Data Sets | Hacker News
- List of European Open Data Catalogues at lod2.okfn.org
- Open Data
- Datasets Archive
- Some Datasets Available on the Web » Data Wrangling Blog
- Lending Club Loan Data
- SMS Spam Collection
- Flickr personal taxonomies
- Yahoo Data for Researchers
- ICWSM Spinnr Challenge 2011 dataset
- Quantum Chaotic Thoughts: Facebook100 Data Set
- Public Data Sets on Amazon Web Services (AWS)
- The ClueWeb09 Dataset
- Census Bureau Home Page
- Data | The World Bank
- ImageNet
- What is Twitter, a Social Network or a News Media? – WWW’10
- dotbot | DotNetDotCom.org
- arXiv.org help – arXiv Bulk Data Access – Amazon S3
- YouTube Dataset
- Face Recognition Homepage – Databases
- Pajek datasets
- UCI Network Data Repository
- Datasets for “The Elements of Statistical Learning”
- Enron Email Dataset
- MovieLens Data Sets | GroupLens Research
- Translation Task – EMNLP 2011 Sixth Workshop on Statistical Machine Translation
- Project Gutenberg
- About WordNet – WordNet – About WordNet
- Aligned Hansards of the 36th Parliament of Canada
- CRCNS – Collaborative Research in Computational Neuroscience – Data sharing
- USENET corpus
- UniGene
- ChEMBLdb
- UCI Machine Learning Repository
- Gene Expression Omnibus (GEO) Main page
- Social Science Data
- IMDB dataset
- Stanford Large Network Dataset Collection
- Google Books n-gram dataset
- Million Song Dataset | scaling MIR research
- Belly Button Biodiversity 2.0
- Sharing PyPi/Maven dependency data « RTFB
- Click Dataset | Center for Complex Networks and Systems Research
- The Electric Rice Cooker — One year of deleted weibos archive
- Registered meteorites that has impacted on Earth visualized – AnalyticBridge
- GeoJSON files for real-time Virginia transportation data.
- NYPD Crash Data Band-Aid
- 11 Billion Clues in 800 Million Documents: A Web Research Corpus Annotated with Freebase Concepts | Research Blog
- Big data set – 3.5 billion web pages – made available for all of us – Big Data News
- Data.Seattle.Gov | Seattle’s Data Site
- New Crawl Data Available! | CommonCrawl
- Detailed data on pass rates, race, and gender for 2013
- Data Download