SysBioChalmers/GECKO

bug: correctly parse PaxDB if taxonomic ID > 4 digits

Soratake-HirotakaYajima opened this issue · 1 comments

on calculateFfactor.m line 57
It is only calculate four digits number.

I suppose like below.
genes = regexprep(genes,'(\d{4}).','');
==> genes = regexprep(genes,'(\d+).','');

I'm copying here the code block for reference:

% Gather Uniprot database for finding MW
uniprotDB = loadDatabases('uniprot', modelAdapter);
uniprotDB = uniprotDB.uniprot;


if ischar(protData) && endsWith(protData,'paxDB.tsv')
    fID         = fopen(fullfile(protData),'r');
    fileContent = textscan(fID,'%s','delimiter','\n');
    headerLines = sum(startsWith(fileContent{1},'#'));
    fclose(fID);


    %Read data file, excluding headerlines
    fID         = fopen(fullfile(protData),'r');
    fileContent = textscan(fID,'%s %s %f','delimiter','\t','HeaderLines',headerLines);
    genes       = fileContent{2};
    %Remove internal geneIDs modifiers
    genes       = regexprep(genes,'(\d{4}).','');
    level       = fileContent{3};
    fclose(fID);
    [a,b]       = ismember(genes,uniprotDB.genes);
    uniprot     = uniprotDB.ID(b(a));
    level(~a)   = [];
    clear protData
    protData.uniprot = uniprot;
    protData.level   = level;
end

If I understand it right, the role of that line is to replace with nothing the first 4 digits and the period that exist in the 2nd column of the provided file, by default 'paxDB.tsv':

#internal_id string_external_id abundance
1862335 4932.YKL060C 18406
1861564 4932.YHR174W 18184

In this file, the column has indeed some numbers and a period preceding the gene ids that we would need.

The suggestion to not restrict it to specifically 4 characters is making the regex more generic, which is ideally what we want. My suggestion would be to further improve this by:

  • removing the capturing group ( )
  • enforcing the regex to start from the beginning of the string ^
  • escaping the . character, since normally . means "anything", but we specifically want it to mean .

The end result would be then

genes = regexprep(genes,'^\d+\.','');

The line above needs testing, as I am not fully confident in the way Matlab interprets regular expressions.