Code to accompany Advanced Analytics with Spark, by Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills.
The source to accompany the 1st edition may be found in the
1st-edition
branch.
The source to accompany the 2nd edition is found in this, the default
master
branch.
Apache Maven 3.2.5+ and Java 8+ are required to build. From the root level of the project,
run mvn package
to compile artifacts into target/
subdirectories beneath each chapter's directory.
- Chapter 2: https://archive.ics.uci.edu/ml/machine-learning-databases/00210/
- Chapter 3: http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html
- Chapter 4: https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/
- Chapter 5: https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (do not use http://www.sigkdd.org/kdd-cup-1999-computer-network-intrusion-detection as the copy has a corrupted line)
- Chapter 6: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2
- Chapter 7: ftp://ftp.nlm.nih.gov/nlmdata/sample/medline/ (
*.gz
) - Chapter 8: http://www.andresmh.com/nyctaxitrips/
- Chapter 9: (see
ch09-risk/data/download-all-symbols.sh
script) - Chapter 10: ftp://ftp.ncbi.nih.gov/1000genomes/ftp/phase3/data/HG00103/alignment/HG00103.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
- Chapter 11: https://github.com/thunder-project/thunder/tree/v0.4.1/python/thunder/utils/data/fish/tif-stack