hughsie/libxmlb

Quebracabezas! Magic words for silo/string-table corruption

Closed this issue · 12 comments

Hi!
I love this bug (it has cost me so much time and is so bizarre!). I was investigating some extremely odd AppStream behavior which I thought was a bug in my token processing code. Ultimately though it turned out that if you put in a bunch of magic words, the silo gets a corrupted string table and creates a completely weird DOM that kind of looks like memory was read at random.

I really hope others can reproduce this too!
The quickest way to reproduce this issue is to add these lines:

xb_builder_node_add_token (id, "paciencia");
xb_builder_node_add_token (id, "posta");
xb_builder_node_add_token (id, "prasentation");
xb_builder_node_add_token (id, "proyectos");
xb_builder_node_add_token (id, "puntuació");
xb_builder_node_add_token (id, "puzl");

xb_builder_node_add_token (id, "quebracabezas");
xb_builder_node_add_token (id, "raspakovnj");
xb_builder_node_add_token (id, "riječ");
xb_builder_node_add_token (id, "ripador");
xb_builder_node_add_token (id, "riverbero");

xb_builder_node_add_token (id, "sécurisé");
xb_builder_node_add_token (id, "seguretat");
xb_builder_node_add_token (id, "skanowani");
xb_builder_node_add_token (id, "slika");
xb_builder_node_add_token (id, "slika");
xb_builder_node_add_token (id, "slika");
xb_builder_node_add_token (id, "songtek");
xb_builder_node_add_token (id, "strategická");
xb_builder_node_add_token (id, "tahak");

xb_builder_node_add_token (id, "tallennin");
xb_builder_node_add_token (id, "teleporti");
xb_builder_node_add_token (id, "testu");
xb_builder_node_add_token (id, "torlő");
xb_builder_node_add_token (id, "traçar");
xb_builder_node_add_token (id, "udelat");
xb_builder_node_add_token (id, "vídeo");
xb_builder_node_add_token (id, "wiedergab");
xb_builder_node_add_token (id, "wtyczki");
xb_builder_node_add_token (id, "zvuk");

to line 2100 in the self test: https://github.com/hughsie/libxmlb/blob/main/src/xb-self-test.c#L2100

The test should then suddenly be unable to read the generated silo, failing with:

ERROR:../src/xb-self-test.c:2186:xb_builder_node_func: assertion failed ("<components origin=\"lvfs\">" "<component type=\"desktop\">" "<id>gimp.desktop</id>" "<icon type=\"stock\">dave</icon>" "<description>hello <em>world!</em>" "</description>" "</component>" "</components>" == xml):
("<components origin=\"lvfs\"><component type=\"desktop\"><id>gimp.desktop</id><icon type=\"stock\">dave</icon><description>hello <em>world!</em></description></component></components>" == "<onents in=\"\"><onent =\"top\"><con>.desktop</con>< =\"k\"></><ription>o <rigin>d!</rigin></ription></onent></onents>")

The weirdest thing is that if you remove some of these words, the test will suddenly pass again... I tried to remove as many terms as possible, and that's the smallest list that I could come up with.
@hughsie Do you have any idea what could be going on here? I suspect some unicode or integer overflow fun, but the former seems unlikely as even Japanese characters are processed just fine, and the latter just doesn't make much sense either.

This was wild to debug.

Ah! I had no idea! Thanks for the quick fix! :-)
Especially when tokenizing a long text I can easily exceed 32 tokens though, like with description tags, but also if there are a lot of explicit keywords defined (which is how I found this issue originally). I don't think I have a good way to determine the 32 "best" tokens, so do you have an idea what to do here?

like with description tags

I don't think the description makes very good search keywords, tbh.

I was getting a lot of complaints about search, but having a good set of tokens from the description addressed that. For example, Steam does not mention the term "game" at all in its name or summary or keywords, but it appears quite a bit in the program's description. Overall, including description tokens has made search a lot better, so I don't really want to give that up...

How do you filter out the useless keywords tho? just get apps to add better Keywords in the desktop file seems to be the only real way to do this without causing search results to get random.

That's not gonna happen, people really don't want to add manual keywords and expect search to work like it does on Google, where you fill in content and the search engine filters out results automatically.
I determine "good" search tokens by their length and whether the stemming algorithm was able to stem them - that seems to work okay actually.

I determine "good" search tokens by their length

Not 100% convinced with that; "game" is not a great example there.

Not 100% convinced with that; "game" is not a great example there.

Why not? The length filter's primary purpose is to get rid of things like "a" and "the", but even if those remained in there, since people rarely search for them alone that would actually be okay (if they were in a set of search terms, since we have to match all the search terms, the actual words would still be there to narrow down results to what the user wanted).

even comparing integers isn't 100% "free" -- if you have 1000 components with 10 tags each with 50 tokens that's a whole ton of comparing, even with indexed, pre-stemmed text queries.

I think I will have to resort to my "tokens-as-tags" hacks then, as suddenly not finding "Steam" anymore when searching for "games" is something the downstream software centers will find unacceptable, especially when it worked before...
I guess you don't want to expand the token limit to something higher, like 128 or 256? ;-)

not finding "Steam" anymore

You don't add the ID as a keyword? The keyword limit is completely hardcoded, but it's kinda required as XbOpcode is allocated on the stack and so has to be a fixed (and not too large) size.

So, I only have 16 components which have more than 256 tokens, and the majority has below 100 - that actually seems not terrible to me (and I should look at the ones with an large list of tokens, maybe there's some optimization potential there). The most tokens I have for a component is 362 for one.

You don't add the ID as a keyword?

I do, and also the name, but you have to search for "Steam" to find steam, while at the moment software centers using libappstream will also find it in a top position if you search for "games". Which, I would argue, is a likely search and also the thing that Steam is all about. If you search for "games" in GNOME Software, you will not find steam at all in the results list.
This is just one example though, there's actually quite a lot more like this where an essential keyword is present a lot in the description, but missing elsewhere.