LFOD/LFOD.github.io

posts/2019-06-04-using-awk-and-r-to-parse-25tb/index

Opened this issue · 1 comments

- Using AWK and R to parse 25tb

Recently I was tasked with parsing 25tb of raw genotype data. This is the story of how I brought the query time and cost down from 8 minutes and $20 to a tenth of a second and less than a penny, plus the lessons learned along the way.

https://livefreeordichotomize.com/posts/2019-06-04-using-awk-and-r-to-parse-25tb/index.html

ercbk commented

Very much enjoyed reading about your journey, especially the AWK stuff. Thanks for writing it. I've never used Spark on a job remotely like this, but the unbalanced partition problem sounds like a data skew problem. I thought this was taken care of with Adaptive Query Execution (AQE), but that is only for Apache Spark 3.2+. So, I wonder if you were perhaps using an older version of Spark. Was also surprised that dplyr::filter beat index-based filtering for your use case. Might want to try out vctrs::vec_slice as it's supposed to be the fastest of the three. I'm sure you aren't currently interested in refactoring given the current performance of your solution and the amount of work you put into it, but I have some stuff in my notebook on Spark optimization and skew that may help with a few of your issues if you ever decide to revisit it.
Partitioning
Data Skew
Optimization
Tidyverse Replacements