detecting improperly encoded text (in perl)
i need a way to detect when a string has been double-encoded into utf-8. that is, a string of utf-8 bytes that was basically treated to an iso-8859-1 to utf-8 conversion.
this will help blo.gs deal with the encoding bugs in blogger.com's changes feed. (which, unfortunately, is not consistently broken: sometimes the encoding is right, sometimes the encoding is wrong. at least, i think sometimes the encoding is right, although i can’t find any examples right now.
what would be even better, of course, would be for blogger to fix the bug. i reported it, and got a we know, we hope to resolve the problem soon
response.
looks like they could take a lesson from joel spolsky's mini-tutorial on unicode. (i’ll admit to being surprised that blogger gets it wrong: i was under the impression that they used java, which i believe has pretty solid unicode support.)
Comments
Add a comment
Sorry, comments on this post are closed.
hey Jim --
just found this page via google looking for the same thing myself. if you're still interested:
http://jmason.org/software/scripts/utf8lint.txt