php’s dumb xml parsing behavior
steve minutillo, author of feed on feeds, runs headlong into the execrable character encoding behavior of php’s xml parsing functions. hey, i was complaining about that just last year... (via phil ringnalda.)
and a related link, this article from the w3c explains how to deal with encoding issues in forms and has a nice regex that verifies whether a string is valid utf-8.
here’s some links culled from an i18n discussion on the twiki site:
Now that I've looked a bit more, there are many algorithms out there for charset detection, but most are aimed at HTML page auto-detection, and may well not work well for URLs:
- Frank Tang's charset detection links - includes simple Perl UTF-8 detector based on legal codings
- Excellent paper on Mozilla's 3-part algorithm using coding legality, character frequencies and two-character frequencies - detects the language as well as the encoding. Too complex for use on URLs, but looks very good.
- Discussion on IRC auto-detection of charsets
- Simple UTF-8 detector in C
- CPAN:Unicode::Japanese - includes auto-detection for various Japanese charsets
- CPAN:Encode::Guess - auto-detection from suitably dissimilar charsets (needs Perl 5.8)
- Browser detection for forms input datatypes including useful undocumented JavaScript to check IE's current charset (try this out now if you are using IE - see Sandbox.TestCharset).
- TextCat, tool for language detection - in Perl, OpenSource
i really need to write the slides for my talk at oscon, which will cover exactly this sort of thing.
Comments
Add a comment
Sorry, comments on this post are closed.
I haven't really tried testing it, since I just ran across it in a Moz bug (the one where they decided to be evil and submit Windows-1252 when the page begs for ISO-8859-1, if the user pastes in a curly quote), but apparently both Moz and IE will fill in a charset value if you put a hidden input named "charset" in your form. I'm not sure how accurate it is, but it's interesting to see Moz obey the encoding in an XML declaration, even on a local file, while IE ignores it, and even over HTTP believes that in the absence of any declared (in a way that it obeys) encoding it should submit Windows-1252 rather than ISO-8859-1.