June, 22, 2004 archives
php’s dumb xml parsing behavior
steve minutillo, author of feed on feeds, runs headlong into the execrable character encoding behavior of php’s xml parsing functions. hey, i was complaining about that just last year... (via phil ringnalda.)
and a related link, this article from the w3c explains how to deal with encoding issues in forms and has a nice regex that verifies whether a string is valid utf-8.
here’s some links culled from an i18n discussion on the twiki site:
Now that I've looked a bit more, there are many algorithms out there for charset detection, but most are aimed at HTML page auto-detection, and may well not work well for URLs:
- Frank Tang's charset detection links - includes simple Perl UTF-8 detector based on legal codings
- Excellent paper on Mozilla's 3-part algorithm using coding legality, character frequencies and two-character frequencies - detects the language as well as the encoding. Too complex for use on URLs, but looks very good.
- Discussion on IRC auto-detection of charsets
- Simple UTF-8 detector in C
- CPAN:Unicode::Japanese - includes auto-detection for various Japanese charsets
- CPAN:Encode::Guess - auto-detection from suitably dissimilar charsets (needs Perl 5.8)
- Browser detection for forms input datatypes including useful undocumented JavaScript to check IE's current charset (try this out now if you are using IE - see Sandbox.TestCharset).
- TextCat, tool for language detection - in Perl, OpenSource
i really need to write the slides for my talk at oscon, which will cover exactly this sort of thing.
perspective
it popped into my head to check something recently. the number of blogs added to blo.gs, per day, since june 15:
+------------+-----------+ | added | new blogs | +------------+-----------+ | 2004-06-15 | 8118 | | 2004-06-16 | 8170 | | 2004-06-17 | 7362 | | 2004-06-18 | 2512 | | 2004-06-19 | 4299 | | 2004-06-20 | 7802 | | 2004-06-21 | 9264 | +------------+-----------+