- Internet Archive is a non-profit library of millions of free books, movies, software, music, websites, and more.
- HathiTrust is a collaborative of academic and research libraries preserving 17+ million digitized items.
- Open EU Data Portal is an European Union open data.
- Social Feed Manager is an open source software that harvests social media data and web resources from Twitter, Tumblr, Flickr, and Sina Weibo.
- Transkribus is a Transcribe. Collaborate. Share and benefit from cutting edge research in Handwritten Text Recognition!
- Textgrid is an open source tools and services support humanistic scholars during the entire process of research, especially in digital scholarly editing.
- webrecorder.io is a web archiving service anyone can use for free to save web pages.
- COVID-19 Twitter Pandemic Archive is a catalog of datasets containing billions of Tweet IDs for COVID-19 related tweets and a set of data visualizations that display high-level monthly stats about the COVID-19 conversations on Twitter.
- Russia-Ukraine ConflictMisinfo Research Portal is a curated list of publicly available datasets for studying dis- & misinformation campaigns on social media in the context of the Russia-Ukraine war.
- Bechdel Test film dataset used for a 2014 article from fivethirtyeight
- Every Doctor Who Villain since 1963
- Age Gap in Hollywood Films
- Places that Anthony Bourdain Traveled
- Broadway in NYC
- FMA: a Dataset for Music Analysis – contains track metadata along with genre information and features. Data is available in zip files, github repository also contains scripts for analysis
- Million Song Dataset
- Arts and Museums Salary dataset (crowdsourced)
- Museum of Modern Art Exhibitions dataset and MOMA Department Heads dataset
- Edvard Much’s Drawings
- 650 Years of European Grape Harvests
- London Lives Coroners Inquests (data around deaths in London 1690-1800)
- Digital Atlas of Roman and Medieval Civilizations – includes data about climate, economy, shipwrecks, and more
- Survey of Scottish Witchcraft
- Documenting the American South – The Church in the Southern Black Community
- Documenting the American South – North American Slave Narratives
- National Prisoner Statistics 1978-2011
- Thomas Pettigrew Papers – Good for network analysis and/or text analysis
- Nixon White House Recordings
- NYC Dog Names
- NCAA student athlete graduation success data
- The National UFO Reporting Center Online Database (not downloadable, but could be converted to a spreadsheet – ask Kate for help)
- Association of Religion Data Archives (ARDA)
- Awesome Public Datasets – index of datasets
- Big 10 Academic Alliance GeoPortal – maps and geodata held by Big 10 universities
- [Datacite]https://search.datacite.org/()
- DataisPlural – spreadsheet of datasets shared from curated email list
- Documenting the Now – Collection of twitter datasets
- Digital Humanities Resources for Project Building – Data Collections & Datasets
- Google Dataset Search
- Library of Congress – (Labs guide for using data)
- MSU Library Datasets
- Project Gutenberg – books available as plain text
- Social networks : online social networks, edges represent interactions between people
- Networks with ground-truth communities : ground-truth network communities in social and information networks
- Communication networks : email communication networks with edges representing communication
- Citation networks : nodes represent papers, edges represent citations
- Collaboration networks : nodes represent scientists, edges represent collaborations (co-authoring a paper)
- Web graphs : nodes represent webpages and edges are hyperlinks
- Amazon networks : nodes represent products and edges link commonly co-purchased products
- Internet networks : nodes represent computers and edges communication
- Road networks : nodes represent intersections and edges roads connecting the intersections
- Autonomous systems : graphs of the internet
- Signed networks : networks with positive and negative edges (friend/foe, trust/distrust)
- Location-based online social networks : social networks with geographic check-ins
- Wikipedia networks, articles, and metadata : talk, editing, voting, and article data from Wikipedia
- Temporal networks : networks where edges have timestamps
- Twitter and Memetracker : memetracker phrases, links and 467 million Tweets
- Online communities : data from online communities such as Reddit and Flickr
- Online reviews : data from online review systems such as BeerAdvocate and Amazon
- User actions : actions of users on social platforms.
- Face-to-face communication networks : networks of face-to-face (non-online) interactions
- Graph classification datasets : disjoint graphs from different classes