php’s dumb xml parsing behavior

steve minutillo, author of feed on feeds, runs headlong into the execrable character encoding behavior of php’s xml parsing functions. hey, i was complaining about that just last year... (via phil ringnalda.)

and a related link, this article from the w3c explains how to deal with encoding issues in forms and has a nice regex that verifies whether a string is valid utf-8.

here’s some links culled from an i18n discussion on the twiki site:

Now that I've looked a bit more, there are many algorithms out there for charset detection, but most are aimed at HTML page auto-detection, and may well not work well for URLs:

i really need to write the slides for my talk at oscon, which will cover exactly this sort of thing.

« well, that took longer than expectedperspective »