Extended Unicode (and all text after an Extended Unicode character) being removed

Description

Originally reported on Google Code with ID 4064
What archive revision are you testing on? 0.9.16.14

What steps will reproduce the problem?
1. Paste into a new work:

Here comes the emoji!
☚
☪
合
decimal
😃
hex
😃

2. Post the work

What is the expected output?
Here comes the emoji!
a hand character
a crescent moon with a dot
a chinese character
decimal
a smiling face emoji
hex
the same smiling face

What do you see?
the file cuts at the word 'decimal'. neither the unicode characters nor the word 'hex'
displays.

At a guess, I suspect
https://github.com/otwcode/otwarchive/blob/master/lib/html_cleaner.rb#L120

It seems to be that it cuts off any character beyond U+FFFF

Environment

None

Status

Assignee

james_

Reporter

CJ Record

Roadmap

None

Priority

Medium

Affects versions

None

Fix versions

None

Components

BackEnd

Difficulty

None

Required Access Level

None

Milestone

Internal 0.9

Google Code Issue ID

4064
Configure