We're updating the issue view to help you get more done. 

Autocomplete's tokenizing sometimes produces blank words

Description

In the autocomplete code, autocomplete_phrase_split is supposed to split both the tags and their search strings into separate tokens. These tokens are used to organize data for the autocomplete (by collecting all information associated with a particular token for a particular autocomplete prefix), and to look up data in the autocomplete.

Unfortunately, autocomplete_phrase_split sometimes produces blank words/tokens for tags such as these:

  • Character (Fandom)

  • Book - Author

  • First Character & Second Character

  • Firstname "Nickname" Lastname

This means that technically, any tags in any of those forms would share a token, and would have the potential to show up in each others' autocompletes. The scoring methods used by the autocomplete code ensure that you most likely won't see them unless the other tokens in your query are extremely rare, but this behavior is still undesirable. (Plus, there are an awful lot of canonical tags of that form, and every time you type one into the autocomplete, the autocomplete code has to iterate through ALL of them. This makes the autocomplete a lot less responsive, and makes the servers work a lot harder for results that are typically irrelevant.)

To replicate:

  1. Pick a media type, and scroll through the list of fandoms until you find a tiny fandom whose name contains a fairly unique-looking word.

  2. Click on the fandom, then enter the word in the "Other Tags" field to check how many canonical tags show up in the autocomplete. If there are more than 5 results, return to step 1 and try again.

  3. Now use the unique word twice in the same search query, using one of the patterns above – e.g. Automan (Automan). Don't include any other words.

I should see only those tags matching the word I typed in. But instead, I also see a whole bunch of other tags that don't contain the word. The only commonality between them is that they resemble one of the patterns listed above, and they're ridiculously popular (which is why they filtered to the top of the list).

Environment

Status

Assignee

ticking instant

Reporter

ticking instant

Roadmap

Tags

Priority

Medium

Affects versions

0.9.204

Fix versions

Components

BackEnd

Difficulty

Medium

Milestone

Internal 0.9