We're updating the issue view to help you get more done.Learn more

Importer breaks off if the source contains the special character number 0

Originally reported on Google Code with ID 2254
What archive revision are you testing on?
Release 0.8.5

If appropriate, enter the URL of a page where the problem can be seen:
(spin-off from http://code.google.com/p/otwarchive/issues/detail?id=785#c39)

What steps will reproduce the problem?
1. Try to import a story from http://www.the-archive.net/index.php (any story).
2. Get gibberish and script tags, but no story.

What is the expected output? What do you see instead?

Since it's not a known source, I'd expect the whole body of the page to be imported.
Instead only data from the head is imported.

The problem is that the source includes a special character, "non SGML character number
0", according to the W3 Validator, and Nokogiri sees it as a string terminator (which
it is, in C and probably other languages) and stops processing there.