/links

Just a bunch of useful links

Primary LanguageScala

links

Just a bunch of useful links. BTW see rust links as well.

Scala

Serialization / Off-heap Data Structures / Unsafe

  • Simple Binary Encoding - supposedly 20-50x faster than Google Protobuf !!
  • Comparison of Cap'n Proto, SBE, FlatBuffers from the Cap'n Proto people
    • Cap'n Proto native layout uses 64-bit words, relies on separate packing/unpacking to achieve efficient wire representation. Has RPC (but not for Java). Bitset support. Java is third party support.
    • Flatbuffers is from Google. 32-bit word size, more compact native representation, native Java support.
    • Both Cap'n Proto and Flatbuffers allows random access of lists, whereas SBE is really only for streaming access
  • Using Unsafe for C-like memory access speeds - a great guide. Many Unsafe operations turn into Java intrinsics - which translate to direct machine code
  • Scala-offheap - fast, safe off heap objects
  • FastTuple - a dynamic (runtime-defined) C-style struct library, with support for off-heap storage. Only works for primitives right now :(
    • and the excellent blog covers all of the on- and off-heap access and allocation patterns on the JVM very thoroughly.
  • ObjectLayout - efficient struct-within-array data structures
  • jvm-unsafe-utils - @rxin of Spark/Shark fame library for working with Unsafe.
  • Agrona and blog post - a ByteBuffer wrapper, off-heap, with atomic / thread-safe update operations. Good for building off heap data structures.
  • Sidney - an experimental columnar nested struct serializer, with Parquet-like repetition counts
  • OHC - Java off-heap cache
  • Boon ByteBuf and the JavaDoc - a very easy to use, auto-growable ByteBuffer replacement, good for efficient IO
  • Jawn - @d6's new fast JSON parser, parses to multiple ASTs including rojoma-json, spray-json, argonaut
  • Grisu-scala - much faster double to string conversion
  • Extracting case class param names using Macros
  • Fast-Serialization - a drop in replacement for Java Serialization but much faster

Concurrency, Actors

Reactive Streams

Database Libs

  • Asyncpools - Akka-based async connection pool for Slick. Akka 2.2 / Scala 2.10.

  • Postgresql-Async - Netty-based async drivers for PostgreSQL and MySQL

  • Relate - a very lightweight, fast Scala wrapper on top of JDBC

Caching

  • Cacheable - a clever memoization / caching library (with Guava, Redis, Memcached or EHCache backends) using Scala 2.10 macros to remember function parameters

Big Data Processing

  • Great list of Big Data Projects

  • List of Database Papers

  • List of free big data sources - includes some Socrata datasets, climate data, data from Google, tweets, etc.

  • Debasish G's list of streaming papers and algorithms - esp stuff on CountMinSketch and HyperLogLog

  • Cubert - CUBE operator + fast "cost-based" block storage on Hadoop / Tez/ Spark

  • Kylin - OLAP CUBEs from HIVE tables, includes query layer

  • Aesop - a scalable pub-sub / change propagation system, esp between different datastores, with reliability. Based on LinkedIn DataBus, suports pull or push producers.

  • Making Zookeeper Resilient, an excellent blog post from Pinterest

  • ImpalaToGo - run Cloudera Impala directly on S3 files without HDFS!

  • Calcite - new Apache project, offers ANSI SQL syntax over regular files and other input sources

  • redash.io - data visualization / collaboration. TODO: integrate this with Spark SQL / Hive...

  • Fast SQL Query Parser in Scala - based on the Scala-LMS project, compiles a query down to C!

  • Probability Monad - super useful for stats or random data generation

  • stringmetric - Approximate string matching and phonetic algorithms

  • Factorie - a Scala library for Natural Language Processing based on factor graphs

Spark

Geospatial and Graph

  • GeoTrellis - distributed raster processing on Spark. Also see GeoMesa - distributed vector database + feature filtering

  • ApertureTiles - system using Spark to generate a tile pyramid for interactive analytical geo exploration

  • Twofishes - Foursquare's Scala-based coarse forward and reverse geocoder

  • trails - parser combinators for graph traversal. Supports Tinker/Blueprints/Neo4j APIs.

  • scala-graph - in-memory graph API based on scala collections. Work in progress.

Collections, Numeric Processing, Fast Loops

  • Breeze, Spire, and Saddle - Scala numeric libraries
    • spire-ops - a set of macros for no-overhead implicit operator enrichment
  • Framian - a new data frame implementation from the authors of Spire
  • Scala DataTable - An immutable, updatable table with heterogenous types of columns. Easily add columns or rows, and have easy Scala collection APIs for iteration.
  • ScalaXY - collection of macros for performant for loops, extension methods etc
  • Squants - The Scala API for Quantities, Units of Measure and Dimensional Analysis
  • An immutable priority map for Scala
  • Unboxing, Runtime Specialization - a cool post on how to do really fast aggregations using unboxed integers
  • product-collections - useful library for working with collections of tuples. Also, great strongly-typed CSV parser.
  • SuperFastHash - also see Murmur3

Big Data Storage

  • Phantom - Scala DSL for Cassandra, supports CQL3 collections, CQL generation from data models, async API based on Datastax driver

  • Athena - Asynchronous Cassandra client built on Akka-IO

  • CCM - easily build local Cassandra clusters for testing!

  • SSTableAttachedSecondaryIndex - Improved Cassandra 2i, OR and many other enhancements. Requires modified C* build.

  • Stubbed Cassandra - super useful for testing C* apps

  • Pithos - an S3-API-compatible object store for Cassandra

  • Doradus - A Graph / OLAP store on top of Cassandra

  • Khronus - Time series DB built on Cassandra + Akka Cluster

  • Stratio-Cassandra - a fork with Lucene full-text search and CQL support (see the blog). Also see Stargate.

  • How CQL maps to Cassandra Internal Storage

  • Sirius - Akka-based in-memory fast key-value store for JVM objects, with Paxos consistency, persistence/txn logs, HA recovery

  • CurioDB - distributed persistent Redis built on Akka cluster, etc. :)

  • Ivory - An immutable, versioned, RDF-triple / fact store for feature extraction / machine learning

  • Hibari - ordered key-value store using chain replicaton for strong consistency

  • Storehaus - Twitter's key-value wrapper around Redis, MySql, and other stores. Has a neat merge() functionality for aggregation of values, lists, etc.

  • ArDB - like Redis, but with spatial indexes, and pluggable storage engines

  • MapDB - Not a database, but rather a database engine with tunable consistency / ACIDness; support for off-heap memory; fast performance; indexing and other features.

  • HPaste - a nice Scala client for HBase

  • OctopusDB paper - interesting idea of using a WAL of RDF triples as the primary storage, with secondary views of row or column orientation

Distributed Systems

Web / REST / General

  • Scalaj-http - really simple REST API. Although, the latest Spray-client has been vastly simplified as well.

  • Quick Start to Twitter Finagle - though one should really look into Finatra

  • REPL as a service - would be kick ass if integrated into Spark

  • Enumeratum - a Scala Enum library, much better than built in Enumeration

  • Ammonite - Scala DSL for easy BASH-like filesystem operations

  • IScala - Scala backend for IPython. Looks promising. There is also Scala Notebook but it's more of a research project.

  • Scaposer - i18n / .po file library

  • Adding Reflection to Scala Macros - example of using reflection in an annotation macro to add automatic ByteBuffer serialization to case classes :)

  • Scaldi - A lightweight dependency injection library, with Akka integration

  • Knobs - Scala config library with reactive change detection, env var substitution, can read from Typesafe Config/HOCON, ZK, AWS

  • How to use Typesafe Config across multiple environments

  • lamma.io - the easiest date generation library

  • Pimpathon - a set of useful pimp-my-library extensions

  • Scala-rainbow - super simple terminal color output, easier than Console.XXX

Build, Tooling

  • Run Scala scripts with dependencies - ie you don't need a project file

  • sbt-assembly 0.10.2 supports adding a shell script to your jar to make it executable! No more "java ...." to start your Scala program, and no more ps ax | grep java | grep ....

  • acyclic - a Compiler plugin to detect cylical dependencies between source files. Eliminate them for faster builds!

  • Other useful SBT plugins - sbt-sonatype, sbt-pom-reader, sbt-sound, plugins page

  • SCoverage - statement coverage tool, much more useful than line-based or branch-based tools. Has SBT plugin. Blog post on why it's an improvement.

  • sbt-jmh - Plugin for running SBT projects with the JMH microbench profiling tool

  • Comcast - a tool to inject network latency, and less-severe issues

  • Adaptive microbenchmarking of big data - really neat JVM agent which allows turning benchmarking code on and off for better benchmarking

  • SBT updates - Tool for discovering updated versions of SBT dependencies

  • Twitter Iago - Perf load test tool based on replaying logs. Compare vs Gatling for example.

  • Thyme and Parsley - microbenchmarking and profiling tools, seems useful

  • ScalaStyle - Scala style checker / linter

  • Towards a Safer Scala - great talk/slides on tools for Scala linting and static analysis

  • utest - a small micro test framework

  • lions share - a neat JVM heap and GC analysis tool, with charts and SBT integration.

SBuild seems like a promising replacement for SBT. Still Scala, but much much simpler, more like Scala version of Make. With MVN dependency and ScalaTest support.

JVM Other

  • Swiss Java Knife - super handy collection of JVM tools. Try java -jar sjk.jar ttop -p PID -o CPU -n 10 for regular reporting of the top 10 threads by CPU usage!
  • -XX:+PerfDisableSharedMem
  • Al's Guide to Cassandra 2.1 Ops - awesome, not just for C* but tools in general
  • Al Tobey's flags for running JDK8 apps. Note: G1GC! Also no need for MaxPermSize anymore: -Xmx8G -Xms8G -Xss256k -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=0
  • Tuning Spark apps for GC - excellent write-up from Intel
  • Perils of writing isolating classloaders - Good read, tips on how to write a classloader that can isolate and load different versions of the same classes
  • Quick dumping your JVM heap using GDB -- too bad it doesn't work on OSX.
  • Start a JMX agent in running JVM: jcmd <pid> ManagementAgent.start jmxremote.port=26010 jmxremote.ssl=false jmxremote.authenticate=false
  • HeapAudit - A Java agent for lightweight production heap profiling
  • Lion's Share - tools for memory analysis, outputs Google Charts compatible output
  • jHiccup -- "Hiccup" or GC pause analysis tool
  • Bintray - friendlier alternative to Sonatype OSS / Maven central. Also see bintray-sbt plugin.
  • Changing JVM flags live - such as enabling GC logging without restarting JVM. Cool!

Monitoring / Infrastructure

  • Keywhiz - a store for infrastructure secrets
  • Ranwhen - Visualize when your system was running, graphs in Terminal
  • HTrace - distributed tracing library, can dump data to Zipkin or HBase
  • cass_top - simple top utility for cass clusters
  • Grafana and Graphene - great replacement UIs for the clunky default Graphite UI
  • Elastic Mesos - create Mesos clusters on AWS with ZK, HDFS
  • Clustering Graphite - in depth look at how to scale out Graphite clusters

Databases

Indexing and OLAP

ML and Data Science

Distributed Systems

Sublime Text

I love Sublime and use it for everything, even Scala! Going to put my Sublime stuff in a separate page.

Best Practices and Design

Other Random Stuff

  • A list of great docs

  • Awesome public datasets - no doubt some are Socrata sites!

  • Mermaid - think of it as Markdown for diagrams. Would be awesome to integrate this into reveal.js!

  • Markdeep - Markdown++ with diagrams, add single line at bottom to convert to HTML!

  • How To Be a Great Developer - a reminder to be empathetic, humble, and make lives around us better. Awesome list.

  • JQ - JSON processor for the shell. Super useful with RESTful servers.

  • Underscore-CLI - a Node-JS based command line JSON parser

  • MacroPy - Scala-like macros, case classes, pattern matching, parser combos for Python (!!)

  • Scala 2.11 vs Swift - Apple's new iOS language is often compared to Scala.

  • Real World OCaml

  • Gherkin - a Lisp implemented in bash !!

  • Nimrod - a neat, compile-straight-to-binary, static systems language with beautiful Python-like syntax, union types, generics, macros, first-class functions. What Go should have been.

  • Bret Victor - A set of excellent essays and talks from a great visual designer

Tips on installing Ruby

becoz it's so darn painful.

  • On OSX: make sure setUID bit is not set on dtrace: sudo chmod -s /usr/sbin/dtrace (see this Homebrew issue)
  • Try chruby and ruby-install instead of rbenv. Installs rubies into /opt/rubies and lighter weight, also there is a fish shell chruby-fish.