June, 22, 2004 archives

php’s dumb xml parsing behavior

steve minutillo, author of feed on feeds, runs headlong into the execrable character encoding behavior of php’s xml parsing functions. hey, i was complaining about that just last year... (via phil ringnalda.)

and a related link, this article from the w3c explains how to deal with encoding issues in forms and has a nice regex that verifies whether a string is valid utf-8.

here’s some links culled from an i18n discussion on the twiki site:

Now that I've looked a bit more, there are many algorithms out there for charset detection, but most are aimed at HTML page auto-detection, and may well not work well for URLs:

Frank Tang's charset detection links - includes simple Perl UTF-8 detector based on legal codings

Excellent paper on Mozilla's 3-part algorithm using coding legality, character frequencies and two-character frequencies - detects the language as well as the encoding. Too complex for use on URLs, but looks very good.

Discussion on IRC auto-detection of charsets

Simple UTF-8 detector in C

CPAN:Unicode::Japanese - includes auto-detection for various Japanese charsets

CPAN:Encode::Guess - auto-detection from suitably dissimilar charsets (needs Perl 5.8)

Browser detection for forms input datatypes including useful undocumented JavaScript to check IE's current charset (try this out now if you are using IE - see Sandbox.TestCharset).

TextCat, tool for language detection - in Perl, OpenSource

i really need to write the slides for my talk at oscon, which will cover exactly this sort of thing.

» Tuesday, June 22, 2004 @ 8:44pm » code » 1 comment, add yours

perspective

it popped into my head to check something recently. the number of blogs added to blo.gs, per day, since june 15:

+------------+-----------+
| added      | new blogs |
+------------+-----------+
| 2004-06-15 |      8118 |
| 2004-06-16 |      8170 |
| 2004-06-17 |      7362 |
| 2004-06-18 |      2512 |
| 2004-06-19 |      4299 |
| 2004-06-20 |      7802 |
| 2004-06-21 |      9264 |
+------------+-----------+

» Tuesday, June 22, 2004 @ 8:52pm » blo.gs » Comment

i like the character used on the metro not-allowed signs. (i tried to post a similar image weeks ago, but my phone didn't like it. maybe it will take this time.)

» Wednesday, June 23, 2004 @ 3:19pm » snapshot » Comment

they call it deflation

hotmail is increasing their mail quota to 250MB. it was nice of google to come along and shake up the webmail industry, which had apparently gotten quite complacent about how much disk space they offered. (as a side-note, i’ve received 771MB of email since march 12, excluding mailing lists.)

one of the great things about pair networks is that they have periodically increased the storage and bandwidth for each account level without increasing prices.

i’ve decided one way to cause problems for someone you didn’t like would be to publish their email address as a place to contact for free gmail invites. (i wonder how many requests for gmail invites just the mere mention of them will attract?)

» Wednesday, June 23, 2004 @ 4:55pm » Comment

« Monday, June 21, 2004 • Wednesday, June 23, 2004 »

trainedmonkey

June, 22, 2004 archives

php’s dumb xml parsing behavior

perspective

they call it deflation