LFOD/real-blog

2019/06/04/using_awk_and_r_to_parse_25tb/

Opened this issue · 1 comments

Live Free or Dichotomize - Using AWK and R to parse 25tb

https://livefreeordichotomize.com/2019/06/04/using_awk_and_r_to_parse_25tb/

Love this very much. I use a lot of awk, and a few suggestions. If you are using grep remember $0 ~ /grep search/ is the equivalent. Also, remember when loading multiple files ENDFILE is also your friend, because you can do stuff in BEGIN, {}, ENDFILE and END.

I have found using BASH5's array structures with mapfile to be an excellent way to manage this and parallelize it. If you are concerned about CPU? Wrap the procedure in an until loop with $(grep -c processor /proc/cpuinfo) and adjust accordingly.

I have seen the exact things you have, and I have done awk work on massive datasets in which a one hour datapull would take a hive query 11 minutes and if I parallelized it it wouldn't help. Meanwhile I did the same thing on a node that my cellphone had more horsepower and got it done in tens of seconds for the whole day.