rom1504/cc2dataset

Consider optionally moving dedup and shuffle to a second step

rom1504 opened this issue · 2 comments

The mapping if done alone can be done using only s3, CPU and network resources.
Very little ram and disk

Although if working perfectly it makes sense to do all in one stage, it might be good to provide the multi steps option for reliability concerns

actually not really needed thanks to dedup being fast for smaller parts

no let's do #18 instead