links

Just a bunch of useful links. BTW see rust links as well.

Scala

Scala Design Patterns - great stuff, how you do (or don't) traditional Java / OOP patterns in Scala
The Human Side of Scala - great post on styling Scala for readability
Sneaking Scala Through the Back Door - how to promote Scala in an organization
Effective Scala - Twitter's guide to writing good Scala code
SBT - a declarative DSL - an excellent guide to SBT tasks and settings
Between Zero & Hero - tips and tricks for the intermediate Scala developer
Better Type Classes - also see one of first links for good intro to type classes
Type classes and generic derivation - How to avoid common boilerplate for type classes and case classes using Shapeless HLists
Type of Types - an unfinished tutorial on the Scala type system
Monads are not Metaphors - a great explanation of monads
Selfless Trait Pattern - allow users to either mix in a Trait or import an Object.
Scalacaster - classic data structures in Scala
Important compiler flags
Recursive Types - signatures like class Foo[T <: Foo[T]], useful for inheritance and proper return types. Tho if you hit this, there are probably better ways of solving the problem, ie via composition.
Preprocessor - combination of different Scala Types like Phantom Types, Recursive Types, Self Types to make pipeline of computation in typesafe manner

Serialization / Off-heap Data Structures / Unsafe

Simple Binary Encoding - supposedly 20-50x faster than Google Protobuf !!
Comparison of Cap'n Proto, SBE, FlatBuffers from the Cap'n Proto people
- Cap'n Proto native layout uses 64-bit words, relies on separate packing/unpacking to achieve efficient wire representation. Has RPC (but not for Java). Bitset support. Java is third party support.
- Flatbuffers is from Google. 32-bit word size, more compact native representation, native Java support.
- Both Cap'n Proto and Flatbuffers allows random access of lists, whereas SBE is really only for streaming access
Using Unsafe for C-like memory access speeds - a great guide. Many Unsafe operations turn into Java intrinsics - which translate to direct machine code
- Also see Which Memory is Faster - Heap ByteBuffer or Direct
Scala-offheap - fast, safe off heap objects
FastTuple - a dynamic (runtime-defined) C-style struct library, with support for off-heap storage. Only works for primitives right now :(
- and the excellent blog covers all of the on- and off-heap access and allocation patterns on the JVM very thoroughly.
ObjectLayout - efficient struct-within-array data structures
jvm-unsafe-utils - @rxin of Spark/Shark fame library for working with Unsafe.
Agrona and blog post - a ByteBuffer wrapper, off-heap, with atomic / thread-safe update operations. Good for building off heap data structures.
Sidney - an experimental columnar nested struct serializer, with Parquet-like repetition counts
OHC - Java off-heap cache
Boon ByteBuf and the JavaDoc - a very easy to use, auto-growable ByteBuffer replacement, good for efficient IO
Jawn - @d6's new fast JSON parser, parses to multiple ASTs including rojoma-json, spray-json, argonaut
Grisu-scala - much faster double to string conversion
Extracting case class param names using Macros
Fast-Serialization - a drop in replacement for Java Serialization but much faster

Concurrency, Actors

Retry for futures. Also, SafeFuture CancellableFuture etc - very useful
Throttling Scala Futures - using a custom executor
Futiles - really useful set of utilities for working with and sequencing Futures, converting between Try, timeouts, etc.
Scala.Rx - "Reactive variables" - smart variables who auto-update themselves when the values they depend on change
Monifu - a nice set of wrappers around j.u.c.Atomic, as well as super-lightweight cancellable tasks and futures utilities. Accompanying blog post.
Colossus - an extremely fast, NIO and Akka-based microservice framework. Read their blog post.
Socko and Xitrum - Two very fast web frameworks built on Akka and Netty
Kamon - great looking Actor monitoring using bytecode weaving? no code change required.
akka-tracing - A distributed tracing Akka extension based on Twitter's Zipkin, which can be used as performance diagnostics and debugging tool. Supports Spray!
DI in Akka - great guide to using MacWire with Akka for DI
Akka Cluster Inventory extension - very useful. All the other blog posts in the series are also excellent reads.
Akka ZK cluster seed - another Akka extension to automatically register seed nodes with ZK
Akka Data Replication - replicated low-latency in memory datastore built using Akka cluster and CRDTs
Actor Provisioning pattern - if you have a long, failure-prone initialization procedure for an actor, this trait splits out the work, to say another actor and dispatcher
Reactive Visualization for Akka streams!!
Akka cluster ordered provisioning and shutdown
Running an Akka cluster with Docker Containers
Why Async - An excellent overview of async architecture from Async I/O all the way up to application layer.
Ask, Tell, and Per-Request Actors - why one company moved from Ask/Futures to per-request
Dos and Donts deploying Akka in Production - an excellent read, full of advice even for non-Akka JVM apps
CKite - Raft Scala implementation, Finagle, MapDB etc.
Dirigiste - dynamic scalable / smarter Threadpools
Scala-gopher - a #golang-style CSP / channels implementation for Scala. Other niceties: defer()

Reactive Streams

Akka Streams Extensions - helpers, connectors with PostGres, and more.
Reactive Kafka
Zoom - reactive programming with ZK, in Scala using ReactiveX

Database Libs

Asyncpools - Akka-based async connection pool for Slick. Akka 2.2 / Scala 2.10.
Postgresql-Async - Netty-based async drivers for PostgreSQL and MySQL
Relate - a very lightweight, fast Scala wrapper on top of JDBC

Caching

Cacheable - a clever memoization / caching library (with Guava, Redis, Memcached or EHCache backends) using Scala 2.10 macros to remember function parameters

Big Data Processing

Great list of Big Data Projects
List of Database Papers
List of free big data sources - includes some Socrata datasets, climate data, data from Google, tweets, etc.
Debasish G's list of streaming papers and algorithms - esp stuff on CountMinSketch and HyperLogLog
Cubert - CUBE operator + fast "cost-based" block storage on Hadoop / Tez/ Spark
Kylin - OLAP CUBEs from HIVE tables, includes query layer
Aesop - a scalable pub-sub / change propagation system, esp between different datastores, with reliability. Based on LinkedIn DataBus, suports pull or push producers.
Making Zookeeper Resilient, an excellent blog post from Pinterest
ImpalaToGo - run Cloudera Impala directly on S3 files without HDFS!
Calcite - new Apache project, offers ANSI SQL syntax over regular files and other input sources
redash.io - data visualization / collaboration. TODO: integrate this with Spark SQL / Hive...
Fast SQL Query Parser in Scala - based on the Scala-LMS project, compiles a query down to C!
Probability Monad - super useful for stats or random data generation
stringmetric - Approximate string matching and phonetic algorithms
Factorie - a Scala library for Natural Language Processing based on factor graphs

Spark

spark-jobserver - REST Job Server for Spark jobs; low-latency query server
docker-spark to easily deploy a Spark cluster
Andy's Spark Notebook
Magellan - Geospatial analytics on Spark
Kafka Spark Consumer - a low-level consumer which avoids the data loss issues with the high level consumer built into Spark Streaming
Tuning Spark Streaming for throughput
Supplemental Spark Projects - lots of other interesting projects, including IPython notebooks, dataframe stuff, stream + historical data processing, and more.

Geospatial and Graph

GeoTrellis - distributed raster processing on Spark. Also see GeoMesa - distributed vector database + feature filtering
ApertureTiles - system using Spark to generate a tile pyramid for interactive analytical geo exploration
Twofishes - Foursquare's Scala-based coarse forward and reverse geocoder
trails - parser combinators for graph traversal. Supports Tinker/Blueprints/Neo4j APIs.
scala-graph - in-memory graph API based on scala collections. Work in progress.

Collections, Numeric Processing, Fast Loops

Breeze, Spire, and Saddle - Scala numeric libraries
- spire-ops - a set of macros for no-overhead implicit operator enrichment
Framian - a new data frame implementation from the authors of Spire
Scala DataTable - An immutable, updatable table with heterogenous types of columns. Easily add columns or rows, and have easy Scala collection APIs for iteration.
ScalaXY - collection of macros for performant for loops, extension methods etc
Squants - The Scala API for Quantities, Units of Measure and Dimensional Analysis
An immutable priority map for Scala
Unboxing, Runtime Specialization - a cool post on how to do really fast aggregations using unboxed integers
product-collections - useful library for working with collections of tuples. Also, great strongly-typed CSV parser.
SuperFastHash - also see Murmur3

Big Data Storage

Phantom - Scala DSL for Cassandra, supports CQL3 collections, CQL generation from data models, async API based on Datastax driver
Athena - Asynchronous Cassandra client built on Akka-IO
CCM - easily build local Cassandra clusters for testing!
SSTableAttachedSecondaryIndex - Improved Cassandra 2i, OR and many other enhancements. Requires modified C* build.
Stubbed Cassandra - super useful for testing C* apps
Pithos - an S3-API-compatible object store for Cassandra
Doradus - A Graph / OLAP store on top of Cassandra
Khronus - Time series DB built on Cassandra + Akka Cluster
Stratio-Cassandra - a fork with Lucene full-text search and CQL support (see the blog). Also see Stargate.
How CQL maps to Cassandra Internal Storage
Sirius - Akka-based in-memory fast key-value store for JVM objects, with Paxos consistency, persistence/txn logs, HA recovery
CurioDB - distributed persistent Redis built on Akka cluster, etc. :)
Ivory - An immutable, versioned, RDF-triple / fact store for feature extraction / machine learning
Hibari - ordered key-value store using chain replicaton for strong consistency
Storehaus - Twitter's key-value wrapper around Redis, MySql, and other stores. Has a neat merge() functionality for aggregation of values, lists, etc.
ArDB - like Redis, but with spatial indexes, and pluggable storage engines
MapDB - Not a database, but rather a database engine with tunable consistency / ACIDness; support for off-heap memory; fast performance; indexing and other features.
HPaste - a nice Scala client for HBase
OctopusDB paper - interesting idea of using a WAL of RDF triples as the primary storage, with secondary views of row or column orientation

Distributed Systems

An excellent talk on Akka Cluster and distributed systems from Jonas Boner, including summary of lots of distributed systems theory

Web / REST / General

Scalaj-http - really simple REST API. Although, the latest Spray-client has been vastly simplified as well.
Quick Start to Twitter Finagle - though one should really look into Finatra
REPL as a service - would be kick ass if integrated into Spark
Enumeratum - a Scala Enum library, much better than built in Enumeration
Ammonite - Scala DSL for easy BASH-like filesystem operations
IScala - Scala backend for IPython. Looks promising. There is also Scala Notebook but it's more of a research project.
Scaposer - i18n / .po file library
Adding Reflection to Scala Macros - example of using reflection in an annotation macro to add automatic ByteBuffer serialization to case classes :)
Scaldi - A lightweight dependency injection library, with Akka integration
Knobs - Scala config library with reactive change detection, env var substitution, can read from Typesafe Config/HOCON, ZK, AWS
How to use Typesafe Config across multiple environments
lamma.io - the easiest date generation library
Pimpathon - a set of useful pimp-my-library extensions
Scala-rainbow - super simple terminal color output, easier than Console.XXX

Build, Tooling

Run Scala scripts with dependencies - ie you don't need a project file
sbt-assembly 0.10.2 supports adding a shell script to your jar to make it executable! No more "java ...." to start your Scala program, and no more ps ax | grep java | grep ....
acyclic - a Compiler plugin to detect cylical dependencies between source files. Eliminate them for faster builds!
Other useful SBT plugins - sbt-sonatype, sbt-pom-reader, sbt-sound, plugins page
SCoverage - statement coverage tool, much more useful than line-based or branch-based tools. Has SBT plugin. Blog post on why it's an improvement.
sbt-jmh - Plugin for running SBT projects with the JMH microbench profiling tool
Comcast - a tool to inject network latency, and less-severe issues
Adaptive microbenchmarking of big data - really neat JVM agent which allows turning benchmarking code on and off for better benchmarking
SBT updates - Tool for discovering updated versions of SBT dependencies
Twitter Iago - Perf load test tool based on replaying logs. Compare vs Gatling for example.
Thyme and Parsley - microbenchmarking and profiling tools, seems useful
ScalaStyle - Scala style checker / linter
Towards a Safer Scala - great talk/slides on tools for Scala linting and static analysis
utest - a small micro test framework
lions share - a neat JVM heap and GC analysis tool, with charts and SBT integration.

SBuild seems like a promising replacement for SBT. Still Scala, but much much simpler, more like Scala version of Make. With MVN dependency and ScalaTest support.

JVM Other

Swiss Java Knife - super handy collection of JVM tools. Try java -jar sjk.jar ttop -p PID -o CPU -n 10 for regular reporting of the top 10 threads by CPU usage!
-XX:+PerfDisableSharedMem
Al's Guide to Cassandra 2.1 Ops - awesome, not just for C* but tools in general
Al Tobey's flags for running JDK8 apps. Note: G1GC! Also no need for MaxPermSize anymore: -Xmx8G -Xms8G -Xss256k -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=0
Tuning Spark apps for GC - excellent write-up from Intel
Perils of writing isolating classloaders - Good read, tips on how to write a classloader that can isolate and load different versions of the same classes
Quick dumping your JVM heap using GDB -- too bad it doesn't work on OSX.
Start a JMX agent in running JVM: jcmd <pid> ManagementAgent.start jmxremote.port=26010 jmxremote.ssl=false jmxremote.authenticate=false
HeapAudit - A Java agent for lightweight production heap profiling
Lion's Share - tools for memory analysis, outputs Google Charts compatible output
jHiccup -- "Hiccup" or GC pause analysis tool
Bintray - friendlier alternative to Sonatype OSS / Maven central. Also see bintray-sbt plugin.
Changing JVM flags live - such as enabling GC logging without restarting JVM. Cool!

Monitoring / Infrastructure

Keywhiz - a store for infrastructure secrets
Ranwhen - Visualize when your system was running, graphs in Terminal
HTrace - distributed tracing library, can dump data to Zipkin or HBase
cass_top - simple top utility for cass clusters
Grafana and Graphene - great replacement UIs for the clunky default Graphite UI
Elastic Mesos - create Mesos clusters on AWS with ZK, HDFS
Clustering Graphite - in depth look at how to scale out Graphite clusters

Databases

Indexing and OLAP

Adaptive Radix Trees - cache friendly indexing for in-memory databases
Nanocubes - Fast visualization of large spatiotemporal datasets. Amazing stuff. Paper and Github repo.
Quotient Cubes - semantic grouping and rollup algorithm for OLAP cubes. Ruby implementation.
Top K queries and cubes
Scalable In-memory Aggregation - column-oriented, in memory with bitmap indexing and memoization

ML and Data Science

LearnDS - A set of IPython notebooks for learning data science
Machine Learning for developers

Distributed Systems

Achieving Great Response Times in Distributed Systems - an excellent talk on how the 99%-tile latency can kill, and techniques to tame it
Raft Visualization - great 5-min visualization of the distributed consensus protocol

Sublime Text

I love Sublime and use it for everything, even Scala! Going to put my Sublime stuff in a separate page.

Best Practices and Design

Semver - Semantic versioning, how to deal with dev workflows and corner cases -- a must read
Pragmatic RESTful API Design - really good stuff
Blameless Post-Mortems - why they are crucial to good culture
How to Pair with Jr Devs - really good advice. Make them type. Listen and be on the same level.
GitHub Flow - how github.com does continuous deploys, uses pull requests for an automated, process-free development workflow. Some gems include naming branches descriptively and using github.com to browse the work currently in progress by looking at active branches.
Pull Requests and other good Github Practices

Other Random Stuff

A list of great docs
Awesome public datasets - no doubt some are Socrata sites!
Mermaid - think of it as Markdown for diagrams. Would be awesome to integrate this into reveal.js!
Markdeep - Markdown++ with diagrams, add single line at bottom to convert to HTML!
How To Be a Great Developer - a reminder to be empathetic, humble, and make lives around us better. Awesome list.
JQ - JSON processor for the shell. Super useful with RESTful servers.
Underscore-CLI - a Node-JS based command line JSON parser
MacroPy - Scala-like macros, case classes, pattern matching, parser combos for Python (!!)
Scala 2.11 vs Swift - Apple's new iOS language is often compared to Scala.
Real World OCaml
Gherkin - a Lisp implemented in bash !!
Nimrod - a neat, compile-straight-to-binary, static systems language with beautiful Python-like syntax, union types, generics, macros, first-class functions. What Go should have been.
Bret Victor - A set of excellent essays and talks from a great visual designer

Tips on installing Ruby

becoz it's so darn painful.

On OSX: make sure setUID bit is not set on dtrace: sudo chmod -s /usr/sbin/dtrace (see this Homebrew issue)
Try chruby and ruby-install instead of rbenv. Installs rubies into /opt/rubies and lighter weight, also there is a fish shell chruby-fish.

stonegao/links

links

Scala

Serialization / Off-heap Data Structures / Unsafe

Concurrency, Actors

Reactive Streams

Database Libs

Caching

Big Data Processing

Spark

Geospatial and Graph

Collections, Numeric Processing, Fast Loops

Big Data Storage

Distributed Systems

Web / REST / General

Build, Tooling

JVM Other

Monitoring / Infrastructure

Databases

Indexing and OLAP

ML and Data Science

Distributed Systems

Sublime Text

Best Practices and Design

Other Random Stuff

Tips on installing Ruby