Sinhala characters with extensions separated into two characters when posting from HTML (non-RTE) text field

Description

From Translation:

Sinhala has extensions to combine two tones together, and that allows them to spell out these two tones by using just the one letter. And apparently, all of these extensions get separated on AO3. So one character becomes two.

Example, in Sinhala, it's possible to write the sound 'gri' with just the one letter, and on AO3, 'gri' appears as two letters, ie 'g' and 'ri', which is incorrect.

Note: For the test case, we are pasting from Google Docs. This probably isn't relevant, but could be.

If you paste the Sinhala word ප්‍රදානය into a text field that does not have the Rich Text editor active (e.g. comment form, news post form in HTML mode, work form in HTML mode) and press post, it separates the first character into two characters. The word becomes ප්රදානය.

If, however, you either:

  • paste into the RTE

  • paste into the HTML editor and then toggle to the RTE

then the work remains in the correct 5-character form: ප්‍රදානය.

Toggling back to the HTML editor after doing one of the above – but before pressing post – will show you that, at some point in the process, it has been converted to ප්‍රදානය, which allows it to display correctly when posted.

(Note that after posting, you won't see it as ප්‍රදානය in the HTML editor, much like you wouldn't see   – you'd see a space as described in .)

The desired behavior, of course, is that after posting, the word should continue to be the 5-character form ප්‍රදානය.

Assignee

Unassigned

Reporter

Sarken

Roadmap

Internationalization

Priority

Medium

Affects versions

Fix versions

None

Components

BackEnd

Difficulty

Medium

Milestone

Internal 0.9