Bespin is a library that contains reference implementations of "big data" algorithms in MapReduce and Spark. This repo contains datasets used in Bepsin demos:
- The file
Shakespeare.txt
contains the The Complete Works of William Shakespeare from Project Gutenberg. - The file
p2p-Gnutella08-adj.txt
contains a snapshot of the Gnutella peer-to-peer file sharing network from August 2002, where nodes represent hosts in the Gnutella network topology and edges represent connections between the Gnutella hosts. This dataset is available from the Stanford Network Analysis Project. - The tarball
taxi-data.tar.gz
contains a one-day slice NY taxi data, chopped into one file per minute. See analyses in Todd Schneider's blog post Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance.