ArtPoon/pangolin

Reduce memory footprint

Closed this issue · 1 comments

Originally I was not able to process 20K+ sequences because my workstation ran out of memory while procssing pangolearn.py. There seem to be two memory intensive steps in this script:

  1. loading and encoding the sequence data as "one-hot" vectors
  2. generating a pandas data frame from these vectors

I am attempting to reduce the RAM consumed by this script by (1) filtering sequences to the required sites (indices) on load, and (2) processing subsets of the filtered data as pandas data frames of a fixed maximum size.

Reading the entire Uk data set still requires too much memory (over 16G) even after filtering out unused sites.