bug: correctly parse PaxDB if taxonomic ID > 4 digits
Soratake-HirotakaYajima opened this issue · 1 comments
on calculateFfactor.m line 57
It is only calculate four digits number.
I suppose like below.
genes = regexprep(genes,'(\d{4}).','');
==> genes = regexprep(genes,'(\d+).','');
I'm copying here the code block for reference:
% Gather Uniprot database for finding MW
uniprotDB = loadDatabases('uniprot', modelAdapter);
uniprotDB = uniprotDB.uniprot;
if ischar(protData) && endsWith(protData,'paxDB.tsv')
fID = fopen(fullfile(protData),'r');
fileContent = textscan(fID,'%s','delimiter','\n');
headerLines = sum(startsWith(fileContent{1},'#'));
fclose(fID);
%Read data file, excluding headerlines
fID = fopen(fullfile(protData),'r');
fileContent = textscan(fID,'%s %s %f','delimiter','\t','HeaderLines',headerLines);
genes = fileContent{2};
%Remove internal geneIDs modifiers
genes = regexprep(genes,'(\d{4}).','');
level = fileContent{3};
fclose(fID);
[a,b] = ismember(genes,uniprotDB.genes);
uniprot = uniprotDB.ID(b(a));
level(~a) = [];
clear protData
protData.uniprot = uniprot;
protData.level = level;
end
If I understand it right, the role of that line is to replace with nothing the first 4 digits and the period that exist in the 2nd column of the provided file, by default 'paxDB.tsv':
GECKO/tutorials/full_ecModel/data/paxDB.tsv
Lines 11 to 13 in b512ea3
In this file, the column has indeed some numbers and a period preceding the gene ids that we would need.
The suggestion to not restrict it to specifically 4 characters is making the regex more generic, which is ideally what we want. My suggestion would be to further improve this by:
- removing the capturing group
( )
- enforcing the regex to start from the beginning of the string
^
- escaping the
.
character, since normally.
means "anything", but we specifically want it to mean.
The end result would be then
genes = regexprep(genes,'^\d+\.','');
The line above needs testing, as I am not fully confident in the way Matlab interprets regular expressions.