links

NOTE: Data Engineering / ML / DS / Data Structures is now a separate section.

Just a bunch of useful links. BTW see rust links as well.

Table of Contents generated with DocToc

Scala
Scala and JVM Concurrency
HTTP / REST
Small Data
Big Data Processing
JVM Other
Monitoring / Infrastructure
Sublime Text
Best Practices and Design
Swift
Other Random Stuff
- Tips on installing Ruby

Scala

Two fast ways to get started playing with JDK/Java/Scala:

Scala-CLI - can easily launch and run single Scala files, get things set up
sdkman - easily install, setup, and switch between multiple JDKs, Scala's, and other JVM SDKs like Groovy, etc.

Good resources for starting out:

Scala for the Impatient - Amazon link - overall my favorite book for starting out and ramping up quickly
UnderscoreIO's free Scala books
Effective Scala - Twitter's guide to writing good Scala code
Building a Minecraft Mod with Scala and Kojo, a graphical discovery/play environment for kids to learn Scala
ScalaQuest - a Kickstarter game for learning Scala
Scala Design Patterns - great stuff, how you do (or don't) traditional Java / OOP patterns in Scala

Great general Scala knowledge:

Visual Scala Reference - all the Scala collections methods with pictures diagramming their usage
Implicit Design Patterns in Scala - great overview from LiHaoYi on all the different uses of implicits
The Human Side of Scala - great post on styling Scala for readability
Sneaking Scala Through the Back Door - how to promote Scala in an organization
Strategic Scala Style: Principle of Least Power - an excellent read about how to choose the path of less complexity and higher readability in the huge landscape that is Scala programming
- Designing Data Types
SBT - a declarative DSL - an excellent guide to SBT tasks and settings
SBT Tips and Tricks
- To test only some tests: testOnly a.b.c.TestName -- -z "word1 word2" - where word1 word2 is a phrase from test description. Use double quotes.
Between Zero & Hero - tips and tricks for the intermediate Scala developer
Functors, Applicatives, and Monads in Pictures - great guide to three potentially confusing terms in FP
Better Type Classes - also see one of first links for good intro to type classes
Type classes and generic derivation - How to avoid common boilerplate for type classes and case classes using Shapeless HLists
Type of Types - an unfinished tutorial on the Scala type system
Monads are not Metaphors - a great explanation of monads
Selfless Trait Pattern - allow users to either mix in a Trait or import an Object.
Tagged Types - great post from Scalac blog
Scalacaster - classic data structures in Scala
Reftree - Automatic object tree diagrams for immutable data
Important compiler flags
Recursive Types - signatures like class Foo[T <: Foo[T]], useful for inheritance and proper return types. Tho if you hit this, there are probably better ways of solving the problem, ie via composition.
Preprocessor - combination of different Scala Types like Phantom Types, Recursive Types, Self Types to make pipeline of computation in typesafe manner
ScalaFix - a tool to rewrite 2.x Scala for the new Dotty compiler
MDoc - type checking for ScalaDoc
ScalaMeta and Macro Annotations - a much more compact way of defining macros
- Macro annotations needs the Macro Paradise compiler plugin.
Scalamacros - this is the future for Scala Macros, 2.12 and later, support for Dotty, etc. Coming end of 2017.
Scala Native - compile Scala to LLVM native code! :)

A separate section with notes on Specialization, Boxing, Inlining, Hi-Perf Scala

Serialization

Filo - my own library for extremely fast, serialized Scala sequences and columnar encoding
PBDirect - automatic serialization to/from Protobufs from Scala case classes with no need for writing .proto's. Perfect for Akka.
Grisu-scala - much faster double to string conversion
Extracting case class param names using Macros

Java, not Scala

Simple Binary Encoding - supposedly 20-50x faster than Google Protobuf !!
Comparison of Cap'n Proto, SBE, FlatBuffers from the Cap'n Proto people
- Cap'n Proto native layout uses 64-bit words, relies on separate packing/unpacking to achieve efficient wire representation. Has RPC (but not for Java). Bitset support. Java is third party support.
- Flatbuffers is from Google. 32-bit word size, more compact native representation, native Java support.
- Both Cap'n Proto and Flatbuffers allows random access of lists, whereas SBE is really only for streaming access
Sidney - an experimental columnar nested struct serializer, with Parquet-like repetition counts
Boon ByteBuf and the JavaDoc - a very easy to use, auto-growable ByteBuffer replacement, good for efficient IO
Fast-Serialization - a drop in replacement for Java Serialization but much faster
Reed-Solomon Erasure Coding Library from Backblaze. Recover or repair from missing chunks of data; a potential alternative to replication
- Great paper on Erasure Coding vs Replication

Configuration

Knobs - Scala config library with reactive change detection, env var substitution, can read from Typesafe Config/HOCON, ZK, AWS
How to use Typesafe Config across multiple environments

General

REPL as a service - would be kick ass if integrated into Spark
Enumeratum - a Scala Enum library, much better than built in Enumeration
Quicklens - modify deeply nested fields in case classes
Ammonite - Scala DSL for easy BASH-like filesystem operations
Better-Files - Great library for easy file operations / NIO utils
IScala - Scala backend for IPython. Looks promising. There is also Scala Notebook but it's more of a research project.
Scaposer - i18n / .po file library
Adding Reflection to Scala Macros - example of using reflection in an annotation macro to add automatic ByteBuffer serialization to case classes :)
Scaldi - A lightweight dependency injection library, with Akka integration
lamma.io - the easiest date generation library
Pimpathon - a set of useful pimp-my-library extensions
Scala-rainbow - super simple terminal color output, easier than Console.XXX

Build, Tooling

Run Scala scripts with dependencies - ie you don't need a project file
sbt-assembly 0.10.2 supports adding a shell script to your jar to make it executable! No more "java ...." to start your Scala program, and no more ps ax | grep java | grep ....
acyclic - a Compiler plugin to detect cylical dependencies between source files. Eliminate them for faster builds!
Splain - a compiler plugin for better more descriptive error messages!
sbt-view - an SBT plugin to make it easy to view JavaDoc/ScalaDoc of dependencies or your own project
sbt-alldocs - SBT plugin (1.0.0+) to generate docs for ALL dependent jars!
Other useful SBT plugins - sbt-sonatype, sbt-pom-reader, sbt-sound, plugins page
Jitpack - jar packaging for Github repos with no jar publishing (or for non-released versions)
sbt-big-project - a plugin to speed up compilation when there are hundreds of projects
Coursier - a much improved jar dependency fetcher, written in pure Scala. Has SBT plugin, programmatic API
SCoverage - statement coverage tool, much more useful than line-based or branch-based tools. Has SBT plugin. Blog post on why it's an improvement.
sbt-jmh - Plugin for running SBT projects with the JMH microbench profiling tool. Also see jmh-profilers project.
- A list of JMH Resources
- A great JMH Tutorial - the rest of this writeup is also an excellent resource on Java and JVM performance
- JMH Scala vs Java - Shipilev analyzes Java vs Scala tail recursion
Profiling JVM Applications - a great guide to FlameGraphs and other tools
jmh-visualizer for visualizing JMH result runs
sbt-jol - inspect Scala/Java object memory layout
Airframe Surface - a great small library to determine type and class param info
Clouseau - SBT and runtime plugin to get size of object graphs
Comcast - a tool to inject network latency, and less-severe issues
CharybdeFS - FUSE layer to inject filesystem faults for testing
Blindsight - structured logging library for Scala
Adaptive microbenchmarking of big data - really neat JVM agent which allows turning benchmarking code on and off for better benchmarking
SBT updates - Tool for discovering updated versions of SBT dependencies
Twitter Iago - Perf load test tool based on replaying logs. Compare vs Gatling for example.
Thyme and Parsley - microbenchmarking and profiling tools, seems useful
ScalaStyle - Scala style checker / linter
Towards a Safer Scala - great talk/slides on tools for Scala linting and static analysis
utest - a small micro test framework
lions share - a neat JVM heap and GC analysis tool, with charts and SBT integration.

SBT Alternatives

Pants - Twitter's production build system, supports Scala and many other languages
CBT - Much simpler build language using methods and classes instead of a DSL.
Fury - Jon Petty's new experimental build tool
SBuild - really old, not updated since 2014.

Scala and JVM Concurrency

There are multiple paradigms for concurrency in Scala and it is vital to be familiar with all of them. Java-style low-level shared-mutable-state concurrency won't be covered here much -- we focus on the Scala paradigms that produce safe, easy-to-use concurrency.

Why Async - An excellent overview of async architecture from Async I/O all the way up to application layer.

Futures and Tasks

Scala has built-in Futures which are good for eager, memoized, asynchronous, one-shot computations. They run in a thread pool and their results are stored in the Future object for multiple consumers. The really nice thing about Futures is that they are strongly typed and composable - you can use map, flatMap, filter etc. to easily chain together asynchronous computation. Many libraries supplement the built-in functionality, and most Scala database libs rely on Future to deliver non-blocking, async I/O.

Daniel Westheide's Guide to Futures
Retry for futures. Also, SafeFuture CancellableFuture etc - very useful
Throttling Scala Futures - using a custom executor
Futiles - really useful set of utilities for working with and sequencing Futures, retries, converting between Try, timeouts, etc.
Demystifying the blocking construct in Scala Futures - great blog explaining not only about the default global ExecutionContext, but choice of thread pool types, and more

There are alternatives which offer lazy, non-memoized versions.

Monix Task - not only is it lazy but you can control execution - it can run synchronously or async

Actors

Akka is one of the most famous Scala libraries and where Scala Futures came from. It is known for Actors, a paradigm for always-running distributed resilient code popularized by the Erlang language in the 70's. Actors are great for safe stateful and distributed computation, based on a shared-nothing, message passing paradigm.

One of the coolest things built on top of Akka actors, which support remote messaging, is Akka Cluster. There is a separate section for that.

Kamon - great looking Actor monitoring using bytecode weaving? no code change required.
akka-tracing - A distributed tracing Akka extension based on Twitter's Zipkin, which can be used as performance diagnostics and debugging tool. Supports Spray!
DI in Akka - great guide to using MacWire with Akka for DI
Actor Provisioning pattern - if you have a long, failure-prone initialization procedure for an actor, this trait splits out the work, to say another actor and dispatcher
Akka mock scheduler - great for testing!
Akka VisualMailbox - Akka traffic patterns visualized in D3
Reactive Visualization for Akka streams!!
Chymyst - Chymyst is a framework for "chemical reactions" on top of Akka - for distributed, concurrent, functional, declarative computing, much higher level than Akka but built on Akka. Kind of a neat paradigm, check it out.
Ask, Tell, and Per-Request Actors - why one company moved from Ask/Futures to per-request
Dos and Donts deploying Akka in Production - an excellent read, full of advice even for non-Akka JVM apps
Understanding Akka's Quarantine State - another great blog post on how quarantine happens and how to avoid it
Akka Anti-Patterns: Hardware - and many other related good posts on anti-patterns
Akka Typed Protocols - Patrik's blog series on Akka typed - a more functional, type safe actors API :)

Reactive Streams

What if you want to stream multiple values and use up multiple threads or async I/O? Futures are great only for one-shot or single value. This is where reactive streams comes in - a standard for asynchronous streaming computation.

Here is a great intro to reactive streams covering why, why backpressure, how it compares to other paradigms.

The best Scala API for pure reactive streams I have found is Monix Observables. It is lightweight, designed for performance, and have a high degree of control over concurrency. There is also Akka Streams which is built on actors.

Parallelism in Monix - a great tutorial
Akka Streams Extensions - helpers, connectors with PostGres, and more.
Reactive Kafka
Zoom - reactive programming with ZK, in Scala using ReactiveX
Akka Streams vs Scalaz Stream

Other possiblities:

RxScala - a Scala API on top of RxJava
Reactor - Java only, from the SpringSource guys, but has IPC/networking
RSocket - Cross-platform network protocol providing Rx semantics, works on top of Aeron, TCP, WebSockets, HTTP/2

Other Concurrency Libs

Scala.Rx - "Reactive variables" - smart variables who auto-update themselves when the values they depend on change
Scala Coroutines - really neat, coroutines with yield. They are more general than reactive streams, but if streaming data is your focus you are probably better off with one of the reactive streams libs.
Scala-gopher - a #golang-style CSP / channels implementation for Scala. Other niceties: defer()
LChannels - sessions/protocol programming using continuations, both local and distributed. Kind of neat.

Non-Scala:

Dirigiste - dynamic scalable / smarter Threadpools
(JAVA) JCTools - very high performance concurrent queues, used by Netty and other projects
(JAVA) Windmill - a library for efficient IO/Network processing, Futures based. Has per-CPU network/IO sockets.

Akka Cluster and Distributed Systems

If you are building a distributed system, you should seriously consider using Akka Cluster.

Intro to Akka Distributed Data - definitely one of the coolest Akka cluster modules, has great potential for distributed system state.
Akka Data Replication - replicated low-latency in memory datastore built using Akka cluster and CRDTs
Akka Management & Discovery - Lightbend's Cluster Extension for auto cluster self discovery using DNS, replaces
- Constructr - coordinated cluster construction / bootstrapping using etcd/consul as discovery service, for Akka, Cassandra (takes care of registration/CAS/discovery protocol)
Akka Cluster Inventory extension - very useful. All the other blog posts in the series are also excellent reads.
Akka ZK cluster seed - another Akka extension to automatically register seed nodes with ZK
Akka cluster ordered provisioning and shutdown
Running an Akka cluster with Docker Containers
New Adaptive Failure Detector for Akka Cluster. Awesome research and hints too about massive clusters.

Other non-Akka (and some non-Scala) distribution libs:

Suuchi is a toolkit for distributed synchronous replication and data partitioning/routing, used in production in a really neat company in India.
CKite - Raft Scala implementation, Finagle, MapDB etc.
CASPaxos - Replicated State Machines without logs - simpler than RAFT since it doesn't use leader election or log replication
An excellent talk on Akka Cluster and distributed systems from Jonas Boner, including summary of lots of distributed systems theory
Strong Eventual Consistency and CRDTs - a must watch by Mark Shapiro on what eventual consistency really means and the role CRDTs play
Achieving Great Response Times in Distributed Systems - an excellent talk on how the 99%-tile latency can kill, and techniques to tame it
Raft Visualization - great 5-min visualization of the distributed consensus protocol

HTTP / REST

Colossus - an extremely fast, NIO and Akka-based microservice framework. Read their blog post.
Socko and Xitrum - Two very fast web frameworks built on Akka and Netty
Quick Start to Twitter Finagle - though one should really look into Finatra
Airframe - whole bunch of building blocks for Scala apps including server/HTTP/metrics/logging
Scalaj-http - really simple REST client API. Although, the latest Spray-client has been vastly simplified as well.

Small Data

Sort of orthogonal to small vs big, but more query language related:

http://sangria-graphql.org/getting-started/ - a Scala GraphQL library

High performance code / Unboxed processing / Macros

Inliner - macros to inline collections, Option, Try, for comprehensions
SIMD in Scala blog post and the LMS Intrinsics library - access to Intel SIMD/SSE/etc instructions!!
scala-newtype - Newtypes for Scala. Wrap other types and even primitives with no runtime overhead
Unboxing, Runtime Specialization - a cool post on how to do really fast aggregations using unboxed integers
Scalaxy-streams - collection of macros for performant for loops, foreach etc. The old project scalaxy.
Metal - fast unboxed Scala data structures. Includes a fast no-allocation Pointer type that replaces Iterator.
OptionVal - no-allocation but type safe replacement for Option

Off-heap Data Structures / Unsafe

Scala-offheap - fast, safe off heap objects
Using Unsafe for C-like memory access speeds - a great guide. Many Unsafe operations turn into Java intrinsics - which translate to direct machine code
- Also see Which Memory is Faster - Heap ByteBuffer or Direct
FastTuple - a dynamic (runtime-defined) C-style struct library, with support for off-heap storage. Only works for primitives right now :(
- and the excellent blog covers all of the on- and off-heap access and allocation patterns on the JVM very thoroughly.

mysafe - Unsafe memory access/leak checker

ObjectLayout - efficient struct-within-array data structures
jvm-unsafe-utils - @rxin of Spark/Shark fame library for working with Unsafe.
Agrona and blog post - a ByteBuffer wrapper, off-heap, with atomic / thread-safe update operations. Good for building off heap data structures.
Byte-buddy, a Java class generation library
OHC - Java off-heap cache
LWJGL - Potentially useful: very fast off heap memory allocators without limitations of allocateDirect; OpenCL library
jnr-ffi - Java Foreign Function Interface, used by JRuby to provide MUCH simpler interface to C code than JNI. Has native memory allocators and utilities for Struct types.
- Also see jnr-ffi-examples and jnr-posix

Better collections, Numeric Processing

Breeze, Spire, and Saddle - Scala numeric libraries
- spire-ops - a set of macros for no-overhead implicit operator enrichment
Framian - a new data frame implementation from the authors of Spire
Scala DataTable - An immutable, updatable table with heterogenous types of columns. Easily add columns or rows, and have easy Scala collection APIs for iteration.
Squants - The Scala API for Quantities, Units of Measure and Dimensional Analysis
Airframe Metrics - human-readable representations of time, data byte size, etc.
An immutable priority map for Scala
product-collections - useful library for working with collections of tuples. Also, great strongly-typed CSV parser.
SuperFastHash - also see Murmur3
LZ4-Java - very fast compression, but also has version of XXHash - much faster than even Murmur3
Squash compression benchmarks
SmoothieMap2 - a low-memory implementation of Google SwissTable for the JVM
bloom-filter-scala - and accompanying blog post explaining why it's the fastest bloom filter in the JVM
Moment Sketches - moment-based quantile sketches for summarizing quantile/histogram/latency data
Histogrammar - library for creating and plotting histograms

Relevant: Histogram - the Ultimate Guide of Binning

Database Libs

Asyncpools - Akka-based async connection pool for Slick. Akka 2.2 / Scala 2.10.
Postgresql-Async - Netty-based async drivers for PostgreSQL and MySQL
Relate - a very lightweight, fast Scala wrapper on top of JDBC

Caching

Cacheable - a clever memoization / caching library (with Guava, Redis, Memcached or EHCache backends) using Scala 2.10 macros to remember function parameters

Big Data Processing

Fast SQL Query Parser in Scala - based on the Scala-LMS project, compiles a query down to C!
Probability Monad - super useful for stats or random data generation
stringmetric - Approximate string matching and phonetic algorithms
Factorie - a Scala library for Natural Language Processing based on factor graphs

Big Data Projects - not necessarily Scala

Great list of Big Data Projects
List of Database Papers
List of free big data sources - includes some Socrata datasets, climate data, data from Google, tweets, etc.
Debasish G's list of streaming papers and algorithms - esp stuff on CountMinSketch and HyperLogLog
Cubert - CUBE operator + fast "cost-based" block storage on Hadoop / Tez/ Spark
Kylin - OLAP CUBEs from HIVE tables, includes query layer
MacroBase - a Stanford / Peter Bailis project to find anomalies in real time/over streaming data
Aesop - a scalable pub-sub / change propagation system, esp between different datastores, with reliability. Based on LinkedIn DataBus, suports pull or push producers.
Making Zookeeper Resilient, an excellent blog post from Pinterest
ImpalaToGo - run Cloudera Impala directly on S3 files without HDFS!
Calcite - new Apache project, offers ANSI SQL syntax over regular files and other input sources
redash.io - data visualization / collaboration. TODO: integrate this with Spark SQL / Hive...

Spark

spark-jobserver - REST Job Server for Spark jobs; low-latency query server
docker-spark to easily deploy a Spark cluster
Andy's Spark Notebook
Magellan - Geospatial analytics on Spark. Also see GeoSpark and Spatial Spark.
Kafka Spark Consumer - a low-level consumer which avoids the data loss issues with the high level consumer built into Spark Streaming
Tuning Spark Streaming for throughput
SparkInternals - extremely detailed description of how Spark internals work
Supplemental Spark Projects - lots of other interesting projects, including IPython notebooks, dataframe stuff, stream + historical data processing, and more.
Salt - Scala/Spark tile generation/visualization for big datasets. Cool!
Flinkrock - scripts to deploy Spark clusters on AWS

Geospatial and Graph

GeoTrellis - distributed raster processing on Spark. Also see GeoMesa - distributed vector database + feature filtering
ApertureTiles - system using Spark to generate a tile pyramid for interactive analytical geo exploration
Twofishes - Foursquare's Scala-based coarse forward and reverse geocoder
SFCurve - a Scala spatial curve library from the excellent folks at LocationTech
trails - parser combinators for graph traversal. Supports Tinker/Blueprints/Neo4j APIs.
scala-graph - in-memory graph API based on scala collections. Work in progress.

Big Data Storage

Phantom - Scala DSL for Cassandra, supports CQL3 collections, CQL generation from data models, async API based on Datastax driver. A bit heavyweight though.
Troy - A lightweight type safe wrapper around CQL/Cassandra client. Focused on CQL type safety.
Athena - Asynchronous Cassandra client built on Akka-IO
CCM - easily build local Cassandra clusters for testing!
Stubbed Cassandra - super useful for testing C* apps
Pithos - an S3-API-compatible object store for Cassandra
Cassandra Compaction and Tombstoning
Sirius - Akka-based in-memory fast key-value store for JVM objects, with Paxos consistency, persistence/txn logs, HA recovery
Ivory - An immutable, versioned, RDF-triple / fact store for feature extraction / machine learning
Hibari - ordered key-value store using chain replicaton for strong consistency
Storehaus - Twitter's key-value wrapper around Redis, MySql, and other stores. Has a neat merge() functionality for aggregation of values, lists, etc.
ArDB - like Redis, but with spatial indexes, and pluggable storage engines
MapDB - Not a database, but rather a database engine with tunable consistency / ACIDness; support for off-heap memory; fast performance; indexing and other features.
HPaste - a nice Scala client for HBase
OctopusDB paper - interesting idea of using a WAL of RDF triples as the primary storage, with secondary views of row or column orientation

JVM Other