php’s dumb xml parsing behavior

steve minutillo, author of feed on feeds, runs headlong into the execrable character encoding behavior of php’s xml parsing functions. hey, i was complaining about that just last year... (via phil ringnalda.)

and a related link, this article from the w3c explains how to deal with encoding issues in forms and has a nice regex that verifies whether a string is valid utf-8.

here’s some links culled from an i18n discussion on the twiki site:

Now that I've looked a bit more, there are many algorithms out there for charset detection, but most are aimed at HTML page auto-detection, and may well not work well for URLs:

Frank Tang's charset detection links - includes simple Perl UTF-8 detector based on legal codings

Excellent paper on Mozilla's 3-part algorithm using coding legality, character frequencies and two-character frequencies - detects the language as well as the encoding. Too complex for use on URLs, but looks very good.

Discussion on IRC auto-detection of charsets

Simple UTF-8 detector in C

CPAN:Unicode::Japanese - includes auto-detection for various Japanese charsets

CPAN:Encode::Guess - auto-detection from suitably dissimilar charsets (needs Perl 5.8)

Browser detection for forms input datatypes including useful undocumented JavaScript to check IE's current charset (try this out now if you are using IE - see Sandbox.TestCharset).

TextCat, tool for language detection - in Perl, OpenSource

i really need to write the slides for my talk at oscon, which will cover exactly this sort of thing.

» Tuesday, June 22, 2004 @ 8:44pm » code » 1 comment, add yours

« well, that took longer than expected • perspective »

Comments

I haven't really tried testing it, since I just ran across it in a Moz bug (the one where they decided to be evil and submit Windows-1252 when the page begs for ISO-8859-1, if the user pastes in a curly quote), but apparently both Moz and IE will fill in a charset value if you put a hidden input named "charset" in your form. I'm not sure how accurate it is, but it's interesting to see Moz obey the encoding in an XML declaration, even on a local file, while IE ignores it, and even over HTTP believes that in the absence of any declared (in a way that it obeys) encoding it should submit Windows-1252 rather than ISO-8859-1.

» Phil Ringnalda (link) » Wednesday, June 23, 2004 @ 12:21am