patzaw/BED

[searchBeid.R] Non-robust String manipulations

Opened this issue · 0 comments

Function searchBeid

The function contains a lot of "low"-level string manipulations. The code is per se not robust.

A better alternative would be to write a parser for the string:

  • it increases the code robustness;
  • may be also faster;

Sample Parser

Have a look on my GitHub page for sample code how to write a parser:

https://github.com/discoleo/R/blob/master/Stat/Tools.Code.R

The function parse.simple implements a minimal parser, including processing and matching of specific brackets. It returns a data.frame with the positions of the various tokens in the original string (without generating intermediary strings). It is implemented (mostly) based on a finite state machine.

The function extract.str extracts then the tokens. The code can be adapted (e.g. to extract directly the desired tokens).

Some examples are available in:

https://github.com/discoleo/R/blob/master/Stat/Tools.Code.Tests.R

For a much simple version of such a parser, see function parseParenth inside the file TextMining.R. It can be used to parse Pubmed abstracts and detect non-matching brackets. The function extractParenth extracts the content between the parenthesis.

https://github.com/discoleo/R/blob/master/TextMining/Pubmed/TextMining.R