detecting improperly encoded text (in perl)

i need a way to detect when a string has been double-encoded into utf-8. that is, a string of utf-8 bytes that was basically treated to an iso-8859-1 to utf-8 conversion.

this will help blo.gs deal with the encoding bugs in blogger.com's changes feed. (which, unfortunately, is not consistently broken: sometimes the encoding is right, sometimes the encoding is wrong. at least, i think sometimes the encoding is right, although i can’t find any examples right now.

what would be even better, of course, would be for blogger to fix the bug. i reported it, and got a “we know, we hope to resolve the problem soon” response.

looks like they could take a lesson from joel spolsky's mini-tutorial on unicode. (i’ll admit to being surprised that blogger gets it wrong: i was under the impression that they used java, which i believe has pretty solid unicode support.)

comments

hey Jim --

just found this page via google looking for the same thing myself. if you're still interested:

http://jmason.org/software/scripts/utf8lint.txt

» Justin Mason (link) » june 9, 2005 6:48pm

add a comment

sorry, comments on this post are closed.