specific extended characters are ignored in tag names when /tags/search

Description

Originally reported on Google Code with ID 3770
What archive revision are you testing on? 0.9.9.6

If appropriate, enter the URL of a page where the problem can be seen:
http://archiveofourown.org/tags/search

What steps will reproduce the problem?
1. search for "Darkleer ♦ The Disciple"
2. get three results, only one using the diamond
3. search for "Darkleer ♦"
4. thus arrives the flood

this also happens with the characters ♥, ♠, and ♣ - all of which are used as relationship
markers in some fandoms, including Homestuck.

Activity

Show:

ticking instant 
February 8, 2022 at 11:06 PM
(edited)

There’s something kind of odd going on with this.

  1. When I do a search for “Darkleer ♦ The Disciple”, I get exactly one matching tag.

  2. Similarly, when I do a search for “Dave ♦ Sawbuck”, I get exactly one matching tag.

  3. When I search for “Darkleer ♦”, I get no results whatsoever.

  4. When I search for just “♦”, I get 100 tags, all of which have a ♦ in them. But none of the matching tags is either “Darkleer ♦ The Disciple” or “Dave ♦ Sawbuck”. In fact, none of the tags at all have the word “Darkleer” in them.

  5. When I search for “Dave ♦”, I get an assortment of 12 tags, all of which have both the word “Dave” and the ♦ symbol in them. But none of those 12 tags are the “Dave ♦ Sawbuck” tag.

I’m wondering if one of the Elasticsearch upgrades changed the way the tokenizer works, so that any tags indexed after the upgrade work fine, and any tags indexed before the upgrade don’t handle the ♦ character properly.

EDIT: On staging, unlike in production, searching for “Dave ♦” does include “Dave ♦ Sawbuck” in the search results. So if this isn’t an indexing issue, there’s something different about the ES configuration for staging vs. production.

EDIT 2: But the “Darkleer ♦ The Disciple” tag has the exact same behavior on staging as on production. So it’s not a staging vs. production difference.

EDIT 3: Yup, this is an indexing issue. Sarken reindexed “Dave ♦ Sawbuck” in production, and it’s now showing up in the “♦” and “Dave ♦” searches (in addition to the “Dave ♦ Sawbuck” search, where it always appeared).

Sarken 
January 30, 2018 at 4:21 AM

Looks like this is back to the original brokenness thanks to https://github.com/otwcode/otwarchive/pull/3224 \o/

Sarken 
December 11, 2017 at 3:46 AM

This appears to have gotten more broken with the Elasticsearch upgrade: now searching for "Dave ♦ Sawbuck" returns zero results rather than too many.

Sam Johnsson 
August 22, 2015 at 2:47 PM

  • *Labels added*: Roadmap-TagWrangling, Keep-SJ

alien 
October 25, 2013 at 11:13 AM

  • *Labels added*: Milestone-Internal0.9

Details

Assignee

Reporter

Roadmap

Search
Tags

Priority

Components

BackEnd

Difficulty

Milestone

Google Code Issue ID

Sentry

Created September 16, 2013 at 4:40 PM
Updated February 8, 2022 at 11:59 PM