Memory consumption problem
RuralHunter opened this issue · 4 comments
I tried to match a url to a bunch of site url patterns. The site pattern for each site is just like this: "..google.com/.". I concatenate all the site patterns into one pattern like this: "(..google.com/.)|(..facebook.com/.)|...". When I test with 100 sites, the memory requirement to build a automaton is very huge. I set the java heap size to 8G but it still fails with OutOfMemory error. I didn't try more since it won't meet my requirement if the memory consumption is so huge. Is it supposed to use so much memory to build an automaton for the pattern like this?
See https://www.brics.dk/automaton/doc/dk/brics/automaton/StringUnionOperations.html (instead of using RegExp).
Sorry maybe github eats my comment characters. The list of my expression are patterns, not strings:
.*\.google\.com/.*
.*\.facebook\.com/.*
.*\.twitter\.com/.*
...
I tried this but the match result is false instead of expected true:
String[] patterns={".*\\.google\\.com/.*",".*\\.facebook\\.com/.*"};
Automaton am=BasicAutomata.makeStringUnion(patterns);
boolean r=am.run("http://www.google.com/aaa");
Instead of building one large regexp, make one for each of the 100 expressions, convert each of them to an automaton, make sure they are all minimized, and then combine them with the 'union' method, for example in groups of 10, minimize the resulting automata for the groups, and finally combine them all into one.
ok, will try that. Thanks.