Data engineering, programming and other news, articles I read related to my profession and interests.
-
Github post-mortem on their recent (2018-10-21) accident. It's still not clear to me how it was possible to get unreplicated writes in one of datacenters. I'm also curious how would databases like Cassandra perform on such failure. Plus, When you run a service large like that, chaos engineering is a must. The parts of the system have to continuously and intentionally fail!
https://blog.github.com/2018-10-30-oct21-post-incident-analysis/
-
How to handle failing messages and retries when consuming from Kafka. Till now, I was doing the DLQ without the retries. The count-based retry queues is a neat idea, although looks a bit operational heavy.
-
Pretty simple tutorial how to build home-grown smart-home alerting system using Pi, KSQL, H2O, Pushbullet and Hass.io.
-
Conda libraries come preconfigured, optimized, in contrast to pip libraries. Wow, didn't know that.
-
I always thought it would be very useful to have node.js based big data processing system. Many people are JS only and could leverage that. Still haven't seen one, but this tutorial shows at least how to quickly stream files to node.js processor.
https://itnext.io/using-node-js-to-read-really-really-large-files-pt-1-d2057fe76b33
-
Mm, nice. Offloading HBase deep/cold storage to S3 (or any cloud storage?). Doesn't look straightforward, as manual has 50 pages.
-
Pulsar: Another system that can offload deep/cold storage to S3 (or any cloud storage?). I wonder when Kafka will finally support such approach.
https://streaml.io/blog/configuring-apache-pulsar-tiered-storage-with-amazon-s3
-
How to run Zookeeper on really large scale? Twitter shares their approach. Learned about Zookeeper local sessions, more about Observers and operations. I haven't had yet such use case. I'm curious how embedded approaches like https://atomix.io/ (based on Raft) perform in such use cases, because it's always a pain to have an additional service like ZK.
https://blog.twitter.com/engineering/en_us/topics/infrastructure/2018/zookeeper-at-twitter.html
-
Why-across-time provenance. On the subject of debugability of distributed systems. This kind of feature could be implemented platform-wide in things like Atlas.
http://delivery.acm.org/10.1145/3270000/3267839/p333-Whittaker.pdf
-
Pretty good comparison of performance and concurrency of Hive, Impala and GC BigQuery. I'm curious how Presto would perform here. Unfortunately missing open-sourcing the benchmark to make it repeatable by others (even without the company private queries).
https://medium.com/@TechKing/benchmarking-google-bigquery-at-scale-13e1e85f3bec
This is just an experiment. I'm not sure if I will systematically put stuff on that list. And of course I just started the list today (2018-10-15) and I don't even think about putting here what I already read in my life :)
Kudos:
- for an idea to https://github.com/sderosiaux and his list https://github.com/sderosiaux/every-single-day-i-tldr.
- for a great weekly newspaper https://dataengweekly.com by (Joe Crobak)[https://www.linkedin.com/in/joecrobak]