/links

Just a bunch of useful links

Primary LanguageScala

links

NOTE: Data Engineering / ML / DS / Data Structures is now a separate section.

Just a bunch of useful links. BTW see rust links as well.

Table of Contents generated with DocToc

Scala

Two fast ways to get started playing with JDK/Java/Scala:

  • Scala-CLI - can easily launch and run single Scala files, get things set up
  • sdkman - easily install, setup, and switch between multiple JDKs, Scala's, and other JVM SDKs like Groovy, etc.

Good resources for starting out:

Great general Scala knowledge:

A separate section with notes on Specialization, Boxing, Inlining, Hi-Perf Scala

Serialization

  • Filo - my own library for extremely fast, serialized Scala sequences and columnar encoding
  • PBDirect - automatic serialization to/from Protobufs from Scala case classes with no need for writing .proto's. Perfect for Akka.
  • Grisu-scala - much faster double to string conversion
  • Extracting case class param names using Macros

Java, not Scala

  • Simple Binary Encoding - supposedly 20-50x faster than Google Protobuf !!

  • Comparison of Cap'n Proto, SBE, FlatBuffers from the Cap'n Proto people

    • Cap'n Proto native layout uses 64-bit words, relies on separate packing/unpacking to achieve efficient wire representation. Has RPC (but not for Java). Bitset support. Java is third party support.
    • Flatbuffers is from Google. 32-bit word size, more compact native representation, native Java support.
    • Both Cap'n Proto and Flatbuffers allows random access of lists, whereas SBE is really only for streaming access
  • Sidney - an experimental columnar nested struct serializer, with Parquet-like repetition counts

  • Boon ByteBuf and the JavaDoc - a very easy to use, auto-growable ByteBuffer replacement, good for efficient IO

  • Fast-Serialization - a drop in replacement for Java Serialization but much faster

  • Reed-Solomon Erasure Coding Library from Backblaze. Recover or repair from missing chunks of data; a potential alternative to replication

Configuration

  • Knobs - Scala config library with reactive change detection, env var substitution, can read from Typesafe Config/HOCON, ZK, AWS
  • How to use Typesafe Config across multiple environments

General

  • REPL as a service - would be kick ass if integrated into Spark

  • Enumeratum - a Scala Enum library, much better than built in Enumeration

  • Quicklens - modify deeply nested fields in case classes

  • Ammonite - Scala DSL for easy BASH-like filesystem operations

  • Better-Files - Great library for easy file operations / NIO utils

  • IScala - Scala backend for IPython. Looks promising. There is also Scala Notebook but it's more of a research project.

  • Scaposer - i18n / .po file library

  • Adding Reflection to Scala Macros - example of using reflection in an annotation macro to add automatic ByteBuffer serialization to case classes :)

  • Scaldi - A lightweight dependency injection library, with Akka integration

  • lamma.io - the easiest date generation library

  • Pimpathon - a set of useful pimp-my-library extensions

  • Scala-rainbow - super simple terminal color output, easier than Console.XXX

Build, Tooling

  • Run Scala scripts with dependencies - ie you don't need a project file

  • sbt-assembly 0.10.2 supports adding a shell script to your jar to make it executable! No more "java ...." to start your Scala program, and no more ps ax | grep java | grep ....

  • acyclic - a Compiler plugin to detect cylical dependencies between source files. Eliminate them for faster builds!

  • Splain - a compiler plugin for better more descriptive error messages!

  • sbt-view - an SBT plugin to make it easy to view JavaDoc/ScalaDoc of dependencies or your own project

  • sbt-alldocs - SBT plugin (1.0.0+) to generate docs for ALL dependent jars!

  • Other useful SBT plugins - sbt-sonatype, sbt-pom-reader, sbt-sound, plugins page

  • Jitpack - jar packaging for Github repos with no jar publishing (or for non-released versions)

  • sbt-big-project - a plugin to speed up compilation when there are hundreds of projects

  • Coursier - a much improved jar dependency fetcher, written in pure Scala. Has SBT plugin, programmatic API

  • SCoverage - statement coverage tool, much more useful than line-based or branch-based tools. Has SBT plugin. Blog post on why it's an improvement.

  • sbt-jmh - Plugin for running SBT projects with the JMH microbench profiling tool. Also see jmh-profilers project.

  • Profiling JVM Applications - a great guide to FlameGraphs and other tools

  • jmh-visualizer for visualizing JMH result runs

  • sbt-jol - inspect Scala/Java object memory layout

  • Airframe Surface - a great small library to determine type and class param info

  • Clouseau - SBT and runtime plugin to get size of object graphs

  • Comcast - a tool to inject network latency, and less-severe issues

  • CharybdeFS - FUSE layer to inject filesystem faults for testing

  • Blindsight - structured logging library for Scala

  • Adaptive microbenchmarking of big data - really neat JVM agent which allows turning benchmarking code on and off for better benchmarking

  • SBT updates - Tool for discovering updated versions of SBT dependencies

  • Twitter Iago - Perf load test tool based on replaying logs. Compare vs Gatling for example.

  • Thyme and Parsley - microbenchmarking and profiling tools, seems useful

  • ScalaStyle - Scala style checker / linter

  • Towards a Safer Scala - great talk/slides on tools for Scala linting and static analysis

  • utest - a small micro test framework

  • lions share - a neat JVM heap and GC analysis tool, with charts and SBT integration.

SBT Alternatives

  • Pants - Twitter's production build system, supports Scala and many other languages
  • CBT - Much simpler build language using methods and classes instead of a DSL.
  • Fury - Jon Petty's new experimental build tool
  • SBuild - really old, not updated since 2014.

Scala and JVM Concurrency

There are multiple paradigms for concurrency in Scala and it is vital to be familiar with all of them. Java-style low-level shared-mutable-state concurrency won't be covered here much -- we focus on the Scala paradigms that produce safe, easy-to-use concurrency.

  • Why Async - An excellent overview of async architecture from Async I/O all the way up to application layer.

Futures and Tasks

Scala has built-in Futures which are good for eager, memoized, asynchronous, one-shot computations. They run in a thread pool and their results are stored in the Future object for multiple consumers. The really nice thing about Futures is that they are strongly typed and composable - you can use map, flatMap, filter etc. to easily chain together asynchronous computation. Many libraries supplement the built-in functionality, and most Scala database libs rely on Future to deliver non-blocking, async I/O.

There are alternatives which offer lazy, non-memoized versions.

  • Monix Task - not only is it lazy but you can control execution - it can run synchronously or async

Actors

Akka is one of the most famous Scala libraries and where Scala Futures came from. It is known for Actors, a paradigm for always-running distributed resilient code popularized by the Erlang language in the 70's. Actors are great for safe stateful and distributed computation, based on a shared-nothing, message passing paradigm.

One of the coolest things built on top of Akka actors, which support remote messaging, is Akka Cluster. There is a separate section for that.

Reactive Streams

What if you want to stream multiple values and use up multiple threads or async I/O? Futures are great only for one-shot or single value. This is where reactive streams comes in - a standard for asynchronous streaming computation.

Here is a great intro to reactive streams covering why, why backpressure, how it compares to other paradigms.

The best Scala API for pure reactive streams I have found is Monix Observables. It is lightweight, designed for performance, and have a high degree of control over concurrency. There is also Akka Streams which is built on actors.

Other possiblities:

  • RxScala - a Scala API on top of RxJava
  • Reactor - Java only, from the SpringSource guys, but has IPC/networking
  • RSocket - Cross-platform network protocol providing Rx semantics, works on top of Aeron, TCP, WebSockets, HTTP/2

Other Concurrency Libs

  • Scala.Rx - "Reactive variables" - smart variables who auto-update themselves when the values they depend on change
  • Scala Coroutines - really neat, coroutines with yield. They are more general than reactive streams, but if streaming data is your focus you are probably better off with one of the reactive streams libs.
  • Scala-gopher - a #golang-style CSP / channels implementation for Scala. Other niceties: defer()
  • LChannels - sessions/protocol programming using continuations, both local and distributed. Kind of neat.

Non-Scala:

  • Dirigiste - dynamic scalable / smarter Threadpools
  • (JAVA) JCTools - very high performance concurrent queues, used by Netty and other projects
  • (JAVA) Windmill - a library for efficient IO/Network processing, Futures based. Has per-CPU network/IO sockets.

Akka Cluster and Distributed Systems

If you are building a distributed system, you should seriously consider using Akka Cluster.

Other non-Akka (and some non-Scala) distribution libs:

HTTP / REST

  • Colossus - an extremely fast, NIO and Akka-based microservice framework. Read their blog post.

  • Socko and Xitrum - Two very fast web frameworks built on Akka and Netty

  • Quick Start to Twitter Finagle - though one should really look into Finatra

  • Airframe - whole bunch of building blocks for Scala apps including server/HTTP/metrics/logging

  • Scalaj-http - really simple REST client API. Although, the latest Spray-client has been vastly simplified as well.

Small Data

Sort of orthogonal to small vs big, but more query language related:

High performance code / Unboxed processing / Macros

  • Inliner - macros to inline collections, Option, Try, for comprehensions
  • SIMD in Scala blog post and the LMS Intrinsics library - access to Intel SIMD/SSE/etc instructions!!
  • scala-newtype - Newtypes for Scala. Wrap other types and even primitives with no runtime overhead
  • Unboxing, Runtime Specialization - a cool post on how to do really fast aggregations using unboxed integers
  • Scalaxy-streams - collection of macros for performant for loops, foreach etc. The old project scalaxy.
  • Metal - fast unboxed Scala data structures. Includes a fast no-allocation Pointer type that replaces Iterator.
  • OptionVal - no-allocation but type safe replacement for Option

Off-heap Data Structures / Unsafe

  • mysafe - Unsafe memory access/leak checker
  • ObjectLayout - efficient struct-within-array data structures

  • jvm-unsafe-utils - @rxin of Spark/Shark fame library for working with Unsafe.

  • Agrona and blog post - a ByteBuffer wrapper, off-heap, with atomic / thread-safe update operations. Good for building off heap data structures.

  • Byte-buddy, a Java class generation library

  • OHC - Java off-heap cache

  • LWJGL - Potentially useful: very fast off heap memory allocators without limitations of allocateDirect; OpenCL library

  • jnr-ffi - Java Foreign Function Interface, used by JRuby to provide MUCH simpler interface to C code than JNI. Has native memory allocators and utilities for Struct types.

Better collections, Numeric Processing

  • Breeze, Spire, and Saddle - Scala numeric libraries

    • spire-ops - a set of macros for no-overhead implicit operator enrichment
  • Framian - a new data frame implementation from the authors of Spire

  • Scala DataTable - An immutable, updatable table with heterogenous types of columns. Easily add columns or rows, and have easy Scala collection APIs for iteration.

  • Squants - The Scala API for Quantities, Units of Measure and Dimensional Analysis

  • Airframe Metrics - human-readable representations of time, data byte size, etc.

  • An immutable priority map for Scala

  • product-collections - useful library for working with collections of tuples. Also, great strongly-typed CSV parser.

  • SuperFastHash - also see Murmur3

  • LZ4-Java - very fast compression, but also has version of XXHash - much faster than even Murmur3

  • Squash compression benchmarks

  • SmoothieMap2 - a low-memory implementation of Google SwissTable for the JVM

  • bloom-filter-scala - and accompanying blog post explaining why it's the fastest bloom filter in the JVM

  • Moment Sketches - moment-based quantile sketches for summarizing quantile/histogram/latency data

  • Histogrammar - library for creating and plotting histograms

Relevant: Histogram - the Ultimate Guide of Binning

Database Libs

  • Asyncpools - Akka-based async connection pool for Slick. Akka 2.2 / Scala 2.10.

  • Postgresql-Async - Netty-based async drivers for PostgreSQL and MySQL

  • Relate - a very lightweight, fast Scala wrapper on top of JDBC

Caching

  • Cacheable - a clever memoization / caching library (with Guava, Redis, Memcached or EHCache backends) using Scala 2.10 macros to remember function parameters

Big Data Processing

Big Data Projects - not necessarily Scala

  • Great list of Big Data Projects
  • List of Database Papers
  • List of free big data sources - includes some Socrata datasets, climate data, data from Google, tweets, etc.
  • Debasish G's list of streaming papers and algorithms - esp stuff on CountMinSketch and HyperLogLog
  • Cubert - CUBE operator + fast "cost-based" block storage on Hadoop / Tez/ Spark
  • Kylin - OLAP CUBEs from HIVE tables, includes query layer
  • MacroBase - a Stanford / Peter Bailis project to find anomalies in real time/over streaming data
  • Aesop - a scalable pub-sub / change propagation system, esp between different datastores, with reliability. Based on LinkedIn DataBus, suports pull or push producers.
  • Making Zookeeper Resilient, an excellent blog post from Pinterest
  • ImpalaToGo - run Cloudera Impala directly on S3 files without HDFS!
  • Calcite - new Apache project, offers ANSI SQL syntax over regular files and other input sources
  • redash.io - data visualization / collaboration. TODO: integrate this with Spark SQL / Hive...

Spark

Geospatial and Graph

  • GeoTrellis - distributed raster processing on Spark. Also see GeoMesa - distributed vector database + feature filtering

  • ApertureTiles - system using Spark to generate a tile pyramid for interactive analytical geo exploration

  • Twofishes - Foursquare's Scala-based coarse forward and reverse geocoder

  • SFCurve - a Scala spatial curve library from the excellent folks at LocationTech

  • trails - parser combinators for graph traversal. Supports Tinker/Blueprints/Neo4j APIs.

  • scala-graph - in-memory graph API based on scala collections. Work in progress.

Big Data Storage

  • Phantom - Scala DSL for Cassandra, supports CQL3 collections, CQL generation from data models, async API based on Datastax driver. A bit heavyweight though.

  • Troy - A lightweight type safe wrapper around CQL/Cassandra client. Focused on CQL type safety.

  • Athena - Asynchronous Cassandra client built on Akka-IO

  • CCM - easily build local Cassandra clusters for testing!

  • Stubbed Cassandra - super useful for testing C* apps

  • Pithos - an S3-API-compatible object store for Cassandra

  • Cassandra Compaction and Tombstoning

  • Sirius - Akka-based in-memory fast key-value store for JVM objects, with Paxos consistency, persistence/txn logs, HA recovery

  • Ivory - An immutable, versioned, RDF-triple / fact store for feature extraction / machine learning

  • Hibari - ordered key-value store using chain replicaton for strong consistency

  • Storehaus - Twitter's key-value wrapper around Redis, MySql, and other stores. Has a neat merge() functionality for aggregation of values, lists, etc.

  • ArDB - like Redis, but with spatial indexes, and pluggable storage engines

  • MapDB - Not a database, but rather a database engine with tunable consistency / ACIDness; support for off-heap memory; fast performance; indexing and other features.

  • HPaste - a nice Scala client for HBase

  • OctopusDB paper - interesting idea of using a WAL of RDF triples as the primary storage, with secondary views of row or column orientation

JVM Other

A separate section with notes on Specialization, Boxing, Inlining, Hi-Perf Scala

Monitoring / Infrastructure

  • DLite - Easier way to run Docker on OSX
  • Keywhiz - a store for infrastructure secrets
  • Ranwhen - Visualize when your system was running, graphs in Terminal
  • HTrace - distributed tracing library, can dump data to Zipkin or HBase
  • Chaos Mesh - Cloud native (ie Kubernetes-integrated) chaos engineering/injection system!
  • cass_top - simple top utility for cass clusters
  • Grafana and Graphene - great replacement UIs for the clunky default Graphite UI
  • Elastic Mesos - create Mesos clusters on AWS with ZK, HDFS
  • Clustering Graphite - in depth look at how to scale out Graphite clusters

Sublime Text

I love Sublime and use it for everything, even Scala! Going to put my Sublime stuff in a separate page.

Best Practices and Design

Swift

Other Random Stuff

  • A list of great docs

  • Awesome public datasets - no doubt some are Socrata sites!

  • Mermaid - think of it as Markdown for diagrams. Now integrated with remark.js AND GITHUB!!

  • Asciinema - record your terminal sessions!

  • Monodraw - diagrams as ASCII. Not free.

  • Markdeep - Markdown++ with diagrams, add single line at bottom to convert to HTML!

  • How To Be a Great Developer - a reminder to be empathetic, humble, and make lives around us better. Awesome list.

  • Choosing a Model for your Open Source Business

  • JQ - JSON processor for the shell. Super useful with RESTful servers.

  • Hub - CLI for Github :)

  • Git WebUI - easy to install web UI for visualizing git history, different branches.

  • Underscore-CLI - a Node-JS based command line JSON parser

  • MacroPy - Scala-like macros, case classes, pattern matching, parser combos for Python (!!)

  • Scala 2.11 vs Swift - Apple's new iOS language is often compared to Scala.

  • Real World OCaml

  • Gherkin - a Lisp implemented in bash !!

  • Futhark - "High-performance purely functional data-parallel array programming on the GPU" - a language for efficient GPU computation

  • Nimrod - a neat, compile-straight-to-binary, static systems language with beautiful Python-like syntax, union types, generics, macros, first-class functions. What Go should have been.

  • Pony - A capabilities-based Actor-centric static language, deadlock-free, null-free, data-race-free!

  • ROC - A statically-typed functional language, with automatic memory management, which aims to be faster than Go and Swift

  • Bret Victor - A set of excellent essays and talks from a great visual designer

  • George Mack's Razors - a compilation of razors to make decisions by

Tips on installing Ruby

becoz it's so darn painful.

  • On OSX: make sure setUID bit is not set on dtrace: sudo chmod -s /usr/sbin/dtrace (see this Homebrew issue)
  • Try chruby and ruby-install instead of rbenv. Installs rubies into /opt/rubies and lighter weight, also there is a fish shell chruby-fish.