Originally reported on Google Code with ID 2254
What archive revision are you testing on?
If appropriate, enter the URL of a page where the problem can be seen:
(spin-off from http://code.google.com/p/otwarchive/issues/detail?id=785#c39)
What steps will reproduce the problem?
1. Try to import a story from http://www.the-archive.net/index.php (any story).
2. Get gibberish and script tags, but no story.
What is the expected output? What do you see instead?
Since it's not a known source, I'd expect the whole body of the page to be imported.
Instead only data from the head is imported.
The problem is that the source includes a special character, "non SGML character number
0", according to the W3 Validator, and Nokogiri sees it as a string terminator (which
it is, in C and probably other languages) and stops processing there.