This list of public data sources are collected and tidyed from blogs, answers, and user reponses. Most of the data sets listed below are free, however, some are not. Other amazingly awesome lists can be found in the awesome-awesomeness and another awesome list.
- 1000 Genomes
- Collaborative Research in Computational Neuroscience (CRCNS)
- Gene Expression Omnibus (GEO)
- Human Microbiome Project (HMP)
- ICOS PSP Benchmark
- MIT Cancer Genomics Data
- NIH Microarray data (FTP)
- Protein Data Bank
- PubChem Project
- PubGene (now Coremine Medical)
- Stanford Microarray Data
- The Personal Genome Project or PGP
- UCSC Public Data
- UniGene
- Australian Weather
- Canadian Meteorological Centre
- Climate Data from UEA (updated at roughly monthly intervals)
- Global Climate Data Since 1929
- NOAA Bering Sea Climate
- NOAA Climate Datasets
- NOAA Realtime Weather Models
- WU Historical Weather Worldwide
- CrossRef DOI URLs
- DBLP Citation dataset
- NBER Patent Citations
- NIST complex networks data collection
- Protein-protein interaction network
- PyPI and Maven Dependency Network
- Scopus Citation Database
- Stanford GraphBase (Steven Skiena)
- Stanford Large Network Dataset Collection
- The Koblenz Network Collection
- The Laboratory for Web Algorithmics (UNIMI)
- UCI Network Data Repository
- UFL sparse matrix collection
- WSU Graph Database
- 3.5B Web Pages - Web graph extracted from CommonCraw 2012 web corpus.
- 53.5B Web clicks - Anonymized HTTP records from 100K users in Indiana Univ.
- CAIDA Internet Datasets - Network traces and topologies at geographically diverse locations.
- ClueWeb09 - About 1B web pages in ten languages that were collected in Jan. and Feb. 2009.
- ClueWeb12 - About 733M web pages collected between Feb. and May 2012.
- CommonCrawl Web Data - Petabytes of data collected over 7 years of web crawling.
- CRAWDAD Wireless datasets (Dartmouth) - A wireless network data resource for research communities.
- OpenMobileData (MobiPerf) - Mobile performance measurement data collected with active tests.
- UCSD Network Telescope - A passive traffic monitoring system covering IPv4 /8 net.
- Challenges in Machine Learning
- DrivenData Competitions for Social Good
- ICWSM Data Challenge (since 2009)
- Kaggle Competition Data
- KDD Cup by Tencent 2012
- Localytics Data Visualization Challenge
- Netflix Prize
- Yelp Dataset Challenge
- CBOE Futures Exchange
- Google Finance
- Google Trends
- NASDAQ
- OANDA
- OSU Financial data
- Quandl
- St Louis Federal
- Yahoo Finance
- BODC - Marine data of nearly 22,000 oceanographic vars.
- EOSDIS - A data collection of NASA's earth observing system data and information system.
- Factual Global Location Data - 65M POIs with extended attributes in 50 countries.
- Global Administrative Areas Database (GADM) - For countries and low-level subdivisions.
- Geo Spatial Data from ASU - Several small spatial or GIS datasets.
- GeoNames - Over eight million placenames (countries, city stat etc.) of the world.
- Natural Earth - Vectors and rasters of the world in multiple scales.
- OpenStreetMap - A free map worldwide maintained by the communities.
- TIGER/Line - Official United States boundaries and roads.
- TwoFishes - Foursquare's coarse geocoder.
- TZ Timezones - A shapefile of the TZ timezones of the world.
- Australia (abs.gov.au)
- Australia (data.gov.au)
- Canada
- Chicago
- EuroStat
- FedStats
- Germany
- Glasgow, Scotland, UK
- Guardian world governments
- London Datastore, U.K
- Netherlands
- New Zealand
- NYC betanyc
- NYC Open Data
- OECD
- Open Government Data (OGD) Platform India
- San Francisco Data sets
- South Africa
- The World Bank
- U.K. Government Data
- U.S. American Community Survey
- U.S. CDC Public Health datasets
- U.S. Census Bureau
- U.S. Department of Housing and Urban Development (HUD)
- U.S. Federal Government Agencies
- U.S. Federal Government Data Catalog
- U.S. Food and Drug Administration (FDA)
- U.S. Open Government
- UK 2011 Census Open Atlas Project
- United Nations
- EHDP Large Health Data Sets - A collection of health datasets across domains and countries.
- Gapminder World - A collection of multi-domain, demographic databases for our world.
- Medicare Coverage Database (MCD) - Containing national and local Coverage Determinations.
- Medicare Data Engine - Download, explore, and visualize Medicare.gov Data.
- Medicare Data File
- 2GB of Photos of Cats - 10K cat images with basic annotations.
- Face Recognition Benchmark - A collection of face datasets for benchmarking algorithms.
- ImageNet - An image database organized according to the WordNet hierarchy.
- Delve Datasets (Univ. of Toronto) - Evaluating datasets for classification and regression.
- eBay Online Auctions (2012) - Seller-auction-bidder data with closing prices.
- IMDb Database - An online database of films, TB programs, and video games.
- Keel Repository - Multiple datasets for classification, regression, time series.
- Lending Club Loan Data - Loan status (Current, Late, Fully Paid, etc.) and latest payment info.
- Machine Learning Data Set Repository - A data search engine for machine learning tasks.
- Million Song Dataset - Audio features and metadata for a million popular music tracks.
- More Song Datasets - Complementary data of cover songs, lyrics, user listening data.
- MovieLens Data Sets - Online movie recommendation including movie tags, user ratings.
- RDataMining - "R and Data Mining" ebook data
- Registered Meteorites on Earth - 34,513 meteorites updated to 2012.
- Restaurants Health Score Data - Health status of restaurants in San Francisco.
- UCI Machine Learning Repository - One of most famous ML data repositories.
- Yahoo Ratings and Classification Data - About music, movies, user clicks, images etc.
- Cooper-Hewitt's Collection Database
- Minneapolis Institute of Arts metadata
- Tate Collection metadata
- The Getty vocabularies
- ClueWeb09 FACC - Annotated English-language Web pages from the ClueWeb09 corpora.
- ClueWeb12 FACC - Annotated English-language Web pages from the ClueWeb12 corpora.
- DBpedia - Multi-domain ontology describing 4.58M “things” with 583M “facts”.
- Flickr Personal Taxonomies - Personalized tagging pictures with descriptive labels.
- Google Books Ngrams (2.2TB) - N-gram corpuses extracted from Google Books.
- Google Web 5gram (1TB, 2006) - 5-gram corpuses extracted from Web pages.
- Gutenberg eBooks List - Basic information about each eBook from Project Gutenberg.
- Hansards - 1.3M aligned text chunks from official records of Canadian Parliament.
- Machine Translation - The recurring translation task focusing on European languages.
- SMS Spam Collection - 5,574 real English messages, labled as being ham or spam.
- USENET corpus - A collection of public USENET postings between Oct 2005 and Jan 2011.
- Wikidata - Wikipedia databases available in JSON and XML formats.
- Wikipedia Links data - 40 Million Entities in Context.
- WordNet - Databases, associated packages and tools.
- CERN Open Data Portal - Experimental data of CMS experiment, ALICE, ATLAS and LHCb
- NSSDC (NASA) - More than 230 TB of data from about 550 space science spacecraft
- Amazon
- Archive.org Datasets
- CMU JASA data archive
- CMU StatLab collections
- Data360
- Datamob.org
- Infochimps
- KDNuggets Data Collections
- Numbray
- Reddit Datasets
- RevolutionAnalytics Collection
- Sample R data sets
- Stats4Stem R data sets
- StatSci.org
- The Washington Post List
- UCLA SOCR data collection
- UFO Reports
- Wikileaks 911 pager intercepts
- Yahoo Webscope
- Academic Torrents (UMB) - Sharing enormous datasets, for researchers, by researchers.
- Archive-it - Web archiving service built at the Internet Archive
- Datahub.io - The easy way to get, use and share data
- DataMarket (Qlik)
- Freebase.com - A community-curated database of well-known people, places, and things
- Harvard Dataverse Network - Scientific data for reproducible research
- ICPSR (UMICH) - Find and analyze data
- Statista.com - Statistics and Studies from more than 18,000 Sources
- Ancestry.com Forum Dataset - Forum users and messages over ten years
- CMU Enron Email - 150 users, mostly senior management of Enron
- Facebook Data Scrape (2005) - 100 American colleges and univ.
- Facebook Social Networks from LAW (since 2007)
- Foursquare (2010, 2011) - Social networks, check-in locations and categories
- Foursquare from UMN/Sarwat (2013) - Users, venues, check-ins, ratings etc.
- General Social Survey (GSS, since 1972) - Demographic and attitudinal questions, topics etc.
- GetGlue - Users rating TV shows
- GitHub Archive - Programmers collaboration, projects progress etc.
- Mobile Social Networks (UMASS) - Timestamped mote-to-mote (up to 27 subjects) connections
- PewResearch Internet Project - A wide range of surveys about library usage, online dating etc.
- SourceForge.net Research Data - Historic and status statistics of projects and users' activities
- Stack Exchange Data Explorer - User-contributed content on the Stack Exchange network
- Titanic Survival Data Set - Demographic information of Titanic passengers
- Twitter Graph - Crawled entire Twitter site including tweets, user profiles, relations
- UCB's Archive of Social Science Data (D-Lab) - Holdings of political, social and health areas
- UCLA Social Sciences Data Archive - A collection of social science data on the Web
- UNIMI/LAW Social Network Datasets - Social networks like amazon, LiveJournal, dblp and more
- Universities Worldwide - Links to 9307 Universities in 205 countries
- UPJOHN for Employment Research - Labor surveys, unemployment spells and more
- Yahoo Graph and Social Data - Web page graph, user-group membership, IM friends etc.
- Youtube Video Graph (2007,2008) - Video relations, uploaders, views, ratings and more
- Betfair Event Results - Fully time-stamped historical Betfair exchange data
- Cricsheet (baseball) - Thousands of Cricket matches
- Ergast Formula 1, from 1950 up to date (API available)
- Football/Soccer resouces (data and APIs)
- Lahman's Baseball Database - Batting and pitching statistics, team stats etc.
- Retrosheet (baseball) - Play-by-Play files, game logs and schedules
- Time Series data Library (TSDL), created by Rob Hyndman, MU
- UC Riverside Time Series, for classification and clustering.
- Airlines OD Data 1987-2008, used by ASA Challenge 2009
- Bike Share Data Systems - Trip histories, site maps etc.
- Edge data for US domestic flights 1990 to 2009
- Half a million Hubway rides in MA
- Marine Traffic - Ship tracks, port calls and more
- NYC Taxi Trip Data 2013 - FOIA/FOILed by Chris Whong
- OpenFlights - Airport, airline and route data
- RITA Airline On-Time Performance data of major air carriers in US
- RITA/BTS transport data collection (TranStat)
- Transport for London (TFL) - Trip histories and networking statistics
- Travel Tracker Survey (TTS), Chicago, 1990, 2007-2008
- U.S. Bureau of Transportation Statistics (BTS)
- U.S. Freight Analysis Framework - Freight movement among states since 2007
- DataWrangling: Some Datasets Available on the Web
- Inside-r: Finding Data on the Internet
- Quora: Where can I find large datasets open to the public?
- RS.io: 100+ Interesting Data Sets for Statistics
- StaTrek: Leveraging open data to understand urban lives