detecting improperly encoded text (in perl)

i need a way to detect when a string has been double-encoded into utf-8. that is, a string of utf-8 bytes that was basically treated to an iso-8859-1 to utf-8 conversion.

this will help blo.gs deal with the encoding bugs in blogger.com's changes feed. (which, unfortunately, is not consistently broken: sometimes the encoding is right, sometimes the encoding is wrong. at least, i think sometimes the encoding is right, although i can’t find any examples right now.

what would be even better, of course, would be for blogger to fix the bug. i reported it, and got a we know, we hope to resolve the problem soon response.

looks like they could take a lesson from joel spolsky's mini-tutorial on unicode. (i’ll admit to being surprised that blogger gets it wrong: i was under the impression that they used java, which i believe has pretty solid unicode support.)

» Sunday, October 12, 2003 @ 2:28pm » blo.gs » 1 comment, add yours

« php{con west 2003 • refurnishing »

Comments

hey Jim --

just found this page via google looking for the same thing myself. if you're still interested:

http://jmason.org/software/scripts/utf8lint.txt

» Justin Mason (link) » Thursday, June 9, 2005 @ 6:48pm