Optimize removeIndices
Closed this issue · 2 comments
ArtPoon commented
These lines:
# for each entry in dataList, remove the irrelevant columns
while len(dataList) > 0:
line = dataList.pop(0)
finalLine = []
for index in range(len(line)):
if index in indiciesToKeep:
finalLine.extend(line[index].vector)
finalList.append(finalLine)
are unnecessarily iterating over every position of each genome - it should be faster to iterate over indiciesToKeep
only:
for index in indiciesToKeep:
if index < len(line):
finalLine.extend(line[index].vector)
ArtPoon commented
Timing with 100 genomes sampled from UK, original code:
(pangolin) art@orolo:~/work/sc2-clustering/data$ pangolin --outfile uk100.out uk100.fa
...
reading in data 07/27/2020, 11:50:22
removing unnecessary columns 07/27/2020, 11:50:26
loading model 07/27/2020, 11:56:07
generating predictions 07/27/2020, 11:56:08
With modified version:
(pangolin) art@orolo:~/work/sc2-clustering/data$ pangolin --outfile uk100-2.out uk100.fa
...
reading in data 07/27/2020, 11:46:04
removing unnecessary columns 07/27/2020, 11:46:08
constructing data frame07/27/2020, 11:46:09
loading model 07/27/2020, 11:46:25
generating predictions 07/27/2020, 11:46:26
Outputs are identical:
(pangolin) art@orolo:~/work/sc2-clustering/data$ diff uk100.out uk100-2.out
ArtPoon commented
Filing pull request