symphonists/search_index

Using MATCH ... AGAINST is not always working well

simon-dt opened this issue · 2 comments

I have a database with names, some of the names are like "Henk" "Loes" "Frederick" ... no problems here but there are also names like "an" "wim" "bil" and "jo" ..... this means that the MYSQL ft_max_word_len variable must be set to 3 or 2 .... and as far as i know there is no way to do this on index level .... or is there?

The only way i found to get around this is to set the mentioned variable in the mysql config file to 2, but i have no access to this file at the webhost. Further more i have some other fulltext indexes where i do not want this max_word_len .... news articles containing words like "the" and "or" and "if" .....

An alternative would be to use the less performing LIKE, but only in circumstances like mentioned above! So, in short, we should be able to choose between matching algorithms on a index basis. This also means that when we perform a search over several sections, the sections that are configured to use the LIKE method would run separably.

I have no idea wetter this is the way to go, just thinking out loud!

Thanks for taking the time to report this, zimmen. You're right that the word limit on fulltext boolean search is frustrating especially if you're on a shared host. I would definitely recommend paying a little bit more for a VPS where you can modify the configuration yourself.

I have been working on a new version of this extension that allows the developer to choose the search method, as you suggest. There are several ways it might allow:

  • boolean fulltext (match/against)
  • LIKE, with optional wildcard %
  • REGEXP

The big problem with LIKE and REGEXP is that you can't easily order the results by relevance. Fulltext boolean search does this for you, ordering the results by the most frequent occurrences of the search term directly from MySQL.

I don't think configuring this per section is the right way to go. Per-section stuff is only for indexing, not retrieval. The search itself should be the one defining how the data should be queried. If you're searching across multiple sections (using the custom data source bundled with the extension) then the same search algorithm needs to be used for all sections. So I see there being an option you pass as a hidden form field: "boolean", "like", "regexp", allowing you to decide how to search on the frontend.

Similarly for single-section search, the Search Index field might provide a dropdown, when you add it to a section, allowing to you choose from the above options in the same way.

Hi Nick,
Unfortunately some of our clients want to get the hosting themselves... If it were up to me we certainly would add them to our own VPS server but that's not always possible. (besides that, we use a Mediatemple DV server and they are awfully slow lately)

I agree that configuring by section isn't the best way to go. An index is an index so setting it there would make no sense.

I think that losing the ability to order by relevance is less important than actually getting results... Especially when searching for names.

I thin