OSS web crawler pattern list needs to be split into multiple files instead of only one.
ZeroCool940711 opened this issue · 1 comments
Hi there, I started using Opensearchserver a few days ago, I have to say its pretty impressive but I found a problem, maybe its something I haven't configured correctly but I think OSS is placing all urls you add to the crawlers pattern list into a single files called "patterns.xml", its was ok at first but now I have a few million urls on that list and every time I open the tab to add or delete a url or if I add one using the api the OSS goes crazy, it uses a lot of resources as it looks like it reads the whole file with urls to the RAM every time you made a modification to the it or you go to a page that needs that file, it would be better if those urls were in different files so they can be loaded faster, probably you could make it so it loads only the first 10 urls as those are the one shown there, that way it wouldn't use so many resources, another thing I noticed was that if you modify the pattern list the whole file is loaded into memory for every url you have to add and then its written back to the file, so, you're reading and writing to that file too many times per seconds, another thing is that if something interrupt the request made to the api or the server crashes the whole pattern list becomes useless at it will corrupt the patterns.xml file, that happened to me once when I was adding all the ".io" domains to the list, I was testing how good OSS was at performance and found out that all the reading and writing to the patterns.xml file is what makes OSS slow, the rest is so good that you could crawl and index all the ".io" and ".com" domains with a single server, I tested it on a server with 32Gb of RAM, 8 CPUs and 8TB of space. One last thing, im not sure if you are using a compression library to compress the indexes but the indexes use almost no space, I tested some other search engines out there and I was not able to do half the thing I did with OSS, the speed searching is also pretty good, if you have time I recommend you take a look at the ZPAQ compression algorithm found on this link its made for incremental backup and its awesome, I think it will make Opensearchserver a lot better than it is if you add it as something extra. Have a good day/night and thanks for the time, sorry if my English is not the best, I hope everything is understandable, again, have a good day/night and thanks for the time.
All these things that you mention are functions of a database, or issues that a database solves. It sounds like the patterns should be stored in a database instead of an xml file.