MarginaliaSearch/MarginaliaSearch

(converter) Improve generator fingerprinting

Opened this issue · 0 comments

The search engine fingerprints the webserver to try to figure out what sort of a website it is. This is done by looking at the meta generator tag and various other tags in DocumentGeneratorExtractor

Most of the website generators that can be fingerprinted with the generator tag should be picked up automatically, but it may be necessary to categorize them with the switch tag that looks like final GeneratorType type = switch (parts[0]) {

For generators that don't set meta generator, they may be fingerprinted through comments, js or other features in fingerprintByComments(). The task is to look at the HTML code for identifying features, and then check for them in the function. There are several examples of this in the code already.

Detection is especially poor for several static site generators. Hugo is picked up fine, but there are many others and some are not detected at this point.