php’s dumb xml parsing behavior

steve minutillo, author of feed on feeds, runs headlong into the execrable character encoding behavior of php’s xml parsing functions. hey, i was complaining about that just last year... (via phil ringnalda.)

and a related link, this article from the w3c explains how to deal with encoding issues in forms and has a nice regex that verifies whether a string is valid utf-8.

here’s some links culled from an i18n discussion on the twiki site:

Now that I've looked a bit more, there are many algorithms out there for charset detection, but most are aimed at HTML page auto-detection, and may well not work well for URLs:

i really need to write the slides for my talk at oscon, which will cover exactly this sort of thing.


I haven't really tried testing it, since I just ran across it in a Moz bug (the one where they decided to be evil and submit Windows-1252 when the page begs for ISO-8859-1, if the user pastes in a curly quote), but apparently both Moz and IE will fill in a charset value if you put a hidden input named "_charset_" in your form. I'm not sure how accurate it is, but it's interesting to see Moz obey the encoding in an XML declaration, even on a local file, while IE ignores it, and even over HTTP believes that in the absence of any declared (in a way that it obeys) encoding it should submit Windows-1252 rather than ISO-8859-1.

» Phil Ringnalda (link) » june 23, 2004 12:21am

add a comment

sorry, comments on this post are closed.