with 'blo.gs' tag

plus c'est la même chose

you might remember a little over a year ago, when automattic acquired blo.gs. no signs of life yet.

blo.gs lives?

so today i happened to stumble across the news from last week that yahoo “transferred” blo.gs to automattic.

i guess i can finally let the domain names lapse that i had planned to create a service like blo.gs on.

time for something new

it has been over two years since yahoo! bought blo.gs and almost a year since the public-facing side of the site stopped updating.

i mentioned earlier that i stopped using blo.gs, so i figure it is about time to actually turn what i have done for myself into a full-fledged website and let others use it. it was john gruber’s mention of how he uses rss feeds that pushed me over the edge.

it will not be the same as blo.gs. blo.gs was sort of infrastructural, but this new thing will be focused on telling you what’s new.

it will be built more around polling than pinging, at least for now. (maybe if it is wildly successful, dealing with the cost of all the ping spam will be worth the benefit of not having to poll.)

it will make use of the amazon web services. maybe not at first, but eventually it will have to scale and i can already see how aws would make that easier.

it will probably be built in python. why not? it gives me an excuse to learn something new.

one thing i have not figured out is what will be free. there will probably be paid accounts of some kind. and free accounts of some kind. the current expenses are $30 in domain names and $20/month in hosting (shared by this site and my other projects, which offset about $5/month). i am not willing to eat much more than that. (yeah, i am still quite the cheeseparer.)

i even have a cute web 2.0 name for it. but i will wait until there is something to show before spilling the beans on that.

and i am not quitting my day job, so it may be weeks or months before there is anything to show.

bye bye, blo.gs

since blo.gs appears to have finally given up the ghost on monday, and the quality of its update information was deteriorating before that, so i have finally gone and implemented my own crawler to figure out update times for the sites i still care about.

it shares a bit of code with the crawler for blog-la-sphere.

i have had it in the back of my head to implement something that works more or less like blo.gs did. but first i have a wedding to enjoy.

not quite a birthday celebration

blo.gs is offline again, and the only notice that’s up is the one from the last time it had an extended downtime. (or was it the time before that?)

in nine days, it will be one year since i sold the site to yahoo! (it was announced about two weeks after that.)

2005 in review: life

three best things about the last year: getting my camera, meeting shannon, and meeting celia.

honorable mention would go to selling blo.gs. (but after six months, it means the most to me as the catalyst for getting the camera.)

worst thing about the last year: the intense dissatisfaction.


blo.gs is back from the dead, including search. a quick test of searching for “los angeles” shows a bit of a spam problem, though. and the old stand-by for finding spam: searching for “paris hilton.”

it beats as it sweeps as it cleans

del.icio.us is the latest site to get hoovered up by yahoo. (and on that note, it looks like blo.gs is limping back to life. the xml version of my favorites is working again.)

somewhere i have a list of links that i was going to fold into a rumination on startups, lifestyle businesses, and the current trend of turning a small website into a job with a big web company. (and by that, i mean yahoo. has google made any of those sorts of acquisitions since blogger? i guess cnet has gotten into the game with their acquisition of consumating.)

one thing that is interesting to note is how many of these have sprung from people not located in the bay area. are they out of ideas up there?

proof of life

hey, they finally figured out how to post some sort of notice on the site:

Note to blo.gs users: While it’s been exciting to watch the number of blogs and feeds in the wild grow so rapidly, as you can imagine, this has created scaleability challenges that need to be addressed by some fundamental changes to the way blo.gs runs.

Availability of the ping and cloud interfaces has been restored, but the website is going to take a few more days. However, when it comes back up, search will be re-enabled, so the website should be restored to full functionality. Thanks for being patient with us.

The blo.gs team — December 7, 2005

bruised with adversity

apparently the blo.gs database tanked late last week, and while the site is mostly down still, the various xml flavors of the favorites lists are available. but they’ve (temporarily?) restarted the main database of blogs, so my list of favorites is truncated to only those entries that correspond to entries in the new database — but without those actually being the correct entries. so instead of jeremy zawodny leading to his site, it leads to this site.

it’s like what happened back in october, but this time they’ve (at least temporarily) reset the database completely, rather than rolling back to where it was in june.

word has it that blo.gs is still not “switched over to yahoo’s infrastructure yet,” by which i guess they mean it is running on some box under someone’s desk (figuratively, not likely literally) instead of a more properly-managed set of machines.

adding favorites on blo.gs

the blo.gs search is still down (i guess yahoo has just decided to abandon the user-facing part of the service), but you can still add a blog to your list of favorites using this bookmarklet.

it finally dawned on me i could use this to re-add the few blogs that got lost during the database mishap they had, which you can read about.. nowhere. they’ve never acknowledged it.

the next thing i need to do is write something to download my list of favorites, go out and actually check if the blogs it claims were updated were actually updated, and then generate my list of favorites with (more) accurate update information. jeremy (still) isn’t quite the blogging machine that blo.gs claims he is.

worked a coin from the cold concrete

“breaking the web wide open!” is a long article by marc canter about new open standards on the internet. i’m named as a “mover and shaker” in the pinging space, which i think says volumes about the limits of that space.

what color is my parachute?

evan williams (of blogger and odeo) says that most people read blogs via their web sites, and not an rss reader. (he doesn’t cite any evidence, but i believe it is probably true.) this was in response to information from bloglines about how many feeds “matter.”

i’m one of those — i like reading personal blogs via their web pages, not in an rss aggregator. it is why i built blo.gs. it is why i might have to build something to replace blo.gs, now that blo.gs has become nearly useless for tracking updated blogs.

why do i prefer to read blogs via their sites and not rss? i like to see jason kottke’s remaindered links, reviews, and longer posts in their overall flow. i like to see the random pictures and latest comments on blogdowntown. i like to see what picture of herself shannon has decided to feature now. i like to look at the comments left on recent posts i found interesting by jeremy zawodny.

and for all of the blogs i read, the design of their site gives me a little context, and a little reminder of who they are. (not in the “who the hell is that?” sense, but the “hey, shannon likes pink!” sense.)

back to the past

looks like blo.gs has had some sort of major hiccup. the latest updates are from june 14, 2005. also known as the day yahoo took over.

what’s odd is that it is only reporting about 7.5 million blogs, and i’m pretty sure the database was bigger than that at the time.

maybe once they get it sorted out they’ll update that news page that has been so quiet these last few months. (and add that link to happygofun they’re supposed to put on the about page.)

the lack of communication is costing them users.

when leonard got in touch with me to pick my brain a little bit about the blo.gs acquisition experience, i didn’t have an inkling that he would be joining yahoo! along with the rest of the upcoming.org team.


when i look at the hobbled state of blo.gs, i lament, a little bit, my decision to not go to yahoo with the site.

ugh. so much more i could say. but no more angsty bullshit.

blo.gs t-shirt the folks at yahoo! sent a couple of blo.gs t-shirts to me a while ago. i held off on showing it off here before it would be a surprise when i sent one to albert, who designed the logo, and it took me way too long to get around to that.

now if only they’d fix the service. i’m sure the addition of the rss ping data has made it more useful (if overzealous) to those plugged into the feedmesh stuff and building services on top of that, but i’m finding it pretty useless for personal blog update tracking these days.

brad implemented a stream of atom updates from livejournal, which is sort of like the blo.gs cloud interface on steroids. nifty.

foo camp is this weekend, so that means that the feedmesh thing is a year old now. and speaking of that, put me in the jason shellen “not invited, but whatever” camp and not the russell beattie “not invited, and they kicked my puppy” camp.

blo.gs is less broken

the folks at yahoo did something to make blo.gs a little less broken. but some people’s blogs, like jeremy zawodny’s still show spurious updates. based on the info page for jeremy’s blog, i bet it is because it has decided that his comment feed is his main rss feed. which isn’t surprise, since that rss feed contains <link>http://jeremy.zawodny.com/blog/</link> in the <channel>.

it’s even worse when an entry gets hijacked by a feed that isn’t really the right one.

blo.gs is broken

the crazy folks at yahoo appear to have hooked up a new data source to blo.gs — maybe their rss ping service? — and now a bunch of blogs like this guy’s are constantly showing up as updated in my list of blogs.

blo.gs: forbes best of the web pick

blo.gs is a forbes.com best of the web pick, although it lost out to technorati as the meta blogs category favorite.

yahoo has extended the blo.gs stream interface so consumers can add filters. cool!

this will be good for cutting the bandwidth in the whole feedmesh process — a consumer could just tell blo.gs not to pass it entries for which it was the source. it’s not everything needed for duplicate suppression, but it is a good way to cut out big chunks of them.

breathing room

drop in bandwidth after sale of blo.gs okay, one more little blo.gs tidbit: the effect of being rid of the service on the bandwidth usage of my server.

the final little spike in outgoing bandwidth (the blue line) is when the final dump of the data was downloaded by the yahoo folks.

the current bandwidth usage is in the 30kbits/sec range. it was generally over 1Mbits/sec before, or at least that’s how it looks from the graphs. (it may have been higher — it appears to get sort of flattened out over time in rrdtool.)


i owe some thanks.

to don, shane, scott, paul, lev, jeremy, and all the other yahoos that made this happen. they really made clear that they got the service from the beginning, and are a great trajectory for the future.

the other prospective buyers, who made the final decision a lot tougher than it could have been. there are a lot of great people and companies out there in this space, and it was good to know that i had options.

albert at happygofun for the most recent redesign of the site.

matt, dunstan, simon, and all of the other users of blo.gs who helped prod me into making things better. and reminded me to keep it simple: not everyone wants an aggregator.

bob of pubsub for jumping on the feedmesh bandwagon with both feet, even if he wasn’t invited to foo camp.

dave and dan for weblogs.com and blogtracker, respectively, for providing the original inspiration (and data feed) for blo.gs.

and finally, my mom and dad for raising me right.

blo.gs has been acquired by yahoo!

the sale of blo.gs has been completed, and i'm proud to announce that yahoo! has acquired the service. as of right now, give or take a few minutes, yahoo! is running blo.gs.

this is the sort of good home that i was looking for — yahoo! obviously has the resources to run and improve blo.gs in pace with the incredible growth of blogs (and syndication in general), and in talking with them it was also clear that we had some of the same vision for the future of the service and the ping/notification infrastructure.

for users of the website and the cloud interface, nothing much is changing. the service will continue to be completely open, and both yahoo! and i hope you continue to use it and help it grow.

even though i’ll no longer be operating blo.gs, i'm not going to disappear from the community. i’m still very interested in blogging and syndication, and believe that blo.gs will continue to have a major impact as a key player in the evolving ping and blogging infrastructure.

some people have asked about the privacy policy during the transition. yahoo! is keeping the blo.gs privacy policy. the data collected on blo.gs will continue to be subject to that privacy policy and you will be given the opportunity to consent to future changes.

more: some thanks for those who made this all happen.

looking for blo.gs listeners

as mentioned over on the blo.gs news page, i’m looking to get in touch with folks using the blo.gs cloud interface. so if that’s you, please drop me a line. thanks!

eweek has covered the feedmesh project and done a little editorializing about it.

that’s the sort of thing that sparks life on a project, of course.

blo.gs dashboard widget dashboard is pretty neat, but it is a pretty raw development environment. i banged together a semi-functional blo.gs widget, and one thing that becomes clear very quickly is that apple should have packaged together a bunch of common stuff into an includable library.

a few points of frustration: grabbing values out of xml (rss in this case) using the dom is annoyingly verbose, the javascript date object is surprisingly limited (and apple didn’t have the smarts to make it easy to get localized date strings), and doing flexible layouts involves lots of tedious hand-coding.

those are just initial impressions, though. it’s possible that it is all much easier once you’ve got a widget or three under your belt.

matt mullenweg has heard that the new owner of blo.gs is cool.

i can’t wait to read on om’s weblog how much i got for it.

lessons learned

it doesn’t help matters that my php.net address (and the aliases i’m on) seems to be getting flooded with some sort of strange german political spam.

blo.gs sale to be finalized within next month.

so many bounce messages and vacation messages and spam whitelisting services and other automated responses. i was far too nice with my notify-before-data-transfer clause in the privacy policy.

i guess i know now what i’ll be doing with the next few weekends.


once i’ve finally found a new home for blo.gs (and i resisted the temptation to do some sort of april fools gag about that), i’ll once again have a dedicated server with some bandwidth to spare.

maybe i should dust off one of the domain names i have laying around, like microcommunity.net, and figure out what to do with it.

another domain i registered and actually started building the site for was battleforlosangeles.com. the idea was to do a site focused on los angeles politics and the mayoral race in particular. but then the mayoral primary election got underway, and i quickly realized that folks like kevin roderick of la observed and brian hay of the hertzberg daily digest were going to do a much better job of covering things than i had the time (or knowledge) to do.

and keymovies.com is the domain i registered to register for last year’s home entertainment expo, and to otherwise be able to do my video store research via something a little bit removed from this site.

but before i do anything new, i will probably tinker with this site some more. i need to make it possible to browse using the tags, and probably will restructure the archives while i’m at it. maybe i’ll think about packaging up the code for this site as yet another open source blogging tool.

or maybe i should do something that doesn’t involve programming. or computers.

end game

so i’ve hit the end game in selling blo.gs.

this is the part where i get to break out the magical measuring scale and weigh issues like the value of stock in startups vs. the value of selling to companies that are less likely to disappear (or at least change hands) in a year or two. (look, my dotcom scars are showing!)

this is a good problem to have, of course. multiple companies interested in buying the service that i am comfortable having own it.

part of me wishes one of them would manage to distinguish themselves with an offer that was clearly better than the others. but the other part of me is glad that none have, because i think i would be immediately suspicious of any such offer.

on an unrelated note, i found and fixed an ugly bug in libmysqld. it’s one of those bugs i found just by simple code inspection, and it was writing the test case that was the hard part.

more fun with stats

the number of unique blogs added to or updated in the blo.gs database, per day:


(the data for this is a couple of months old because i stopped logging this information. the way i was logging it was having trouble with the volume of data, and i lacked the interest to do it right.)


i know i was (and am) a jerk for saying “i don’t have a price in mind, so don’t ask.”

so i’m not surprised when people ask anyway.

but i was telling the truth.

at least as much truth as that statement can be.

i am tempted, when people ask, to give a number that i believe to be absurd.

after all, maybe somebody will not find it so absurd.

(and in case you don’t think you would, it would rhyme with cotillion.)

blo.gs growth

some people have been interested in the growth of the data that blogs collects. here’s some numbers. this is from a snapshot of the database from sunday morning, so it isn’t to-the-minute.

of the 6,602,676 entries in the blo.gs database snapshot i used, rss/atom feeds were known for 2,512,959 of them. that doesn’t mean more didn’t have rss/atom feeds, just that blo.gs didn’t know about them.

of the blogs that were updated in the last 30 days of the snapshot data, 71% had known rss/atom feeds.

this is the number of blogs added to the database each month. i wasn’t tracking this before september 2003, so the earlier number is all the blogs prior to that.


the big jump in january is from getting a feed that includes the livejournal data from pubsub.com.

here’s the count of blogs last updated in each given month:


and based on the ip address, here’s the top hosts:


needless to say, the entries for the spammer are gone now. blogdrive.com shows up multiple times because they use multiple ip addresses.

here is where i would insert all sorts of caveats about how these numbers are derived if i cared to hold people’s hands when dealing with numbers like this. these are free numbers, and you’re getting what you paid for them.

price vs. personality

i’ve finally gotten through the first wave of emails from people interested in blo.gs.

one of the reason i kept blo.gs going as long as i have is that i’ve liked having it out there as an independent service. it’s always been a neutral ground as far as the various vendors of blog publishing software, or blog search engines.

there’s one other crowd of people interested that i have very mixed feelings about: the “search engine optimization” folks.

there’s four components to the site that has people interested:

  1. the database of users
  2. the database of blogs
  3. the ping.blo.gs endpoint
  4. the service

my impression of each interested party is strongly colored by the interest they demonstrate in each component.

fortunately, it appears that there is enough serious interest from people and companies that i would be happy selling out to that i won’t really have to wrestle with this issue too seriously.

and here’s one thought that i passed along to a couple of people: united airlines lost $326 million last year, so perhaps your lowball offer “because you’ve lost so much already” (paraphrasing) would be even more attractive to them.

matt haughey noticed a bogus blogspot site in his spam. when i was on the suicide mission known as homepage.com (second-generation geocities), it was software pirates that caused us huge problems with their automated signups. i can imagine it’s only worse with the spammers and scam artists these days.

i’m surprised that blogger doesn’t seem to have done much to prevent this. you can see the automated crap via searches at blo.gs for things like “herbal” and “hilton”.

it would be helpful if services like blogspot published information on sites that are deleted as well as updated, so services like blo.gs, technorati, feedster, pubsub.com, etc could drop the sites from their databases, too.

on losing money

one of the dumb things i did on the initial page about the blo.gs sale is just put the overall expense and income figures for the site, without listing the monthly net income (which is there now). i know better than to throw out numbers like that, and it has certainly influenced the numbers that people have thrown back at me. (one somewhat remarkable thing is how many people have picked the same number.)

i have a hard time writing and talking about this because the notion of not talking about money is so deeply ingrained in me. this came up in conversation with my parents last weekend, when my mom was talking about a neighbor who was very free with how much their kids made. i’m pretty sure my parents haven’t known how much i make since i graduated from college.

in any case, it should be pretty apparent that “losing” money is not a big deal for me. there are people in the world who put the accumulation of money at or near the center of their life’s agenda. i’m not one of those people. (and somehow i keep falling ass-backwards into it. go figure.)

from a blog entry about xfn from david berlind of zdnet: “Web-based RSS aggregation provider Blo.gs is one company that's ahead of the curve in this respect, explicitly providing XFN support in its blogroll generator (I haven't tested it.)” (via tantek çelik.)

okay, there’s a couple of things wrong with that sentence. but i’ll consider it timely press. and the entry is a really good overview of why xfn is cooler than the social networking services.

why sell

there have been two common responses to my decision to sell blo.gs. the first has just been “why, the hosting costs or the time?” the short answer is the time. while i tossed out the number of $3500 of what the costs of the service have been, it’s good to realize that goes back to november 2002, and includes the original cost of the server that it is running on. cash-flow wise, the service loses about $20 per month (together with this site, since they’re hosted on the same machine and both run google ads). if i keep this server, i’ll actually be spending more on hosting once blo.gs is gone.

but that’s not to say that the site takes a lot of time, either. the only real ongoing maintenance is handling requests to clean up duplicate or bad entries, and dealing with the occassional database hiccup. but it could take a lot of time, and this is where i think the site would be better off without me. i’m just not willing to put in the time to make the site and its services better.

so that brings me to the second common response, which has been offers to host the site while letting me keep control of it. as you might guess from my time vs. money explanation above, that doesn’t do anything to address why it is i’m selling the site. so while i appreciate those offers, they don’t interest me.

i’m sure i’ll have to say more about this later. the response so far has been interesting.

blo.gs is for sale

i’ve decided to unload blo.gs. there are more details on the site.

URI::Fetch is a new perl module from ben trott (of movable type renown) that does compression, ETag, and last-modified handling when retrieving web resources. the lazyweb delivers again.

speaking of that, i found i had to do one additional thing to my php code that fetches pages because of a non-existent workaround for server bugs in the version of curl i’m using. so when blo.gs fetches a page to verify a ping and gets a particular compression-related error, it goes back out and requests the page again without compression.

congrats to mark fletcher and the bloglines team on being acquired by ask jeeves.

once upon a time, a business development person at ask jeeves called me to talk about blo.gs. as i recall, the gist of the conversation from my side was that the site made no money, i spent no time on it, and there really wasn’t much to it.

this is another of those obvious-in-retrospect services that i can now kick myself for not building. having the vision, courage, and patience to execute on the idea is the hard part, of course.

at last year’s foo camp, when the limping feedmesh thing kicked off, someone suggested that we set up a yahoo group for discussion, and i made some comment about preferring something “more real.” a funny thing to say when one of the guys who built it (mark) was in the room. but in hindsight, i’m glad nobody wasted the time trying to create anything more real.

who’s next? i would think pubsub.com would be a likely acquisition for someone.

i also find this acquisition funny because there was a time when i almost ended up working at ask jeeves because i knew the ceo at the time. i got an early-morning call (or what was early-morning for me those days) where i agreed to fly up to interview, but i turned around a few hours later and cancelled, once i was awake and realized that i had no desire to work at a place using iis and asp, or relocate to the bay area.

the bandwidth bump

blue is the outgoing bandwidth, green is the incoming bandwidth. i’m not sure why there was a big initial spike.

repeating myself

for the blo.gs cloud service, i had written a little server in php that took input from connections on a unix domain socket and repeated it out to a number of external tcp connections. but it would keep getting stuck when someone wasn’t reading data fast enough, and i couldn’t figure out how to get the php code to handle that.

so i rewrote it in c. it’s only 274 lines of code, and it seems to work just fine. it was actually a pretty straightforward port from php to c, although i had to spend some quality time refreshing my memory on how all the various socket functions work.

there’s a visible bump in the graph of my outgoing bandwidth from after this was fixed.

rumor du jour

rumor has it that six apart (makers of movable type blogging software and typepad blogging service) are going to buy live journal (and by live journal, i think they mean danga interactive). it seems like it would be a good fit from what i know of the people involved in both companies and their development platforms. (they’re both perl shops.) and six apart would be getting some of the folks doing the most interesting low-budget, open-source web scalability work that i’ve seen.

rumors that anyone is about to buy blo.gs are completely untrue. unless they aren’t.

weblog software pinging idea

it would be nice if weblog software did not start pinging services like blo.gs until the third entry or so. that would eliminate most of the requests i get to remove entries that just appeared because someone was testing new blogging software at a temporary location, or testing the install of new blogging software before switching their main site over to it. the same thing goes for blogs appearing in blogger’s list of changed sites.

and i am endlessly amused by the variations of with-and-without index.php or index.html, or 'http://example.com/blog' vs. 'http://example.com/blog/'. i need to finish up the handling of redirects found when checking if the site was updated, and probably add code to handle the presence or absence of index.php (or index.html) like it handles the 'www.' prefix.

three out of four ain’t bad

mysql 4.1.8 is out and it includes a fix for the bug that had been plaguing blo.gs. it also contains a fix i made for another bug.

i now have code in linux, mysql, and php. if only my patch to apache had been accepted, i’d have code in the whole LAMP stack. (the CookieDomain configuration setting was finally added about two years later, but not using my patch.)

glenn re-raises the rss bandwidth issue. i’ll admit i’m lazy, and haven’t implemented if-modified-since handling for my own aggregator (which only polls hourly, and only news sites — i don’t read blogs via an aggregator), or if-modified-since handling for the rss feeds i produce (particularly the scraped feeds).

but one thing i have done is implement a system that serves up 403 responses to people who poll the scraped feeds more than once an hour. it blocks over ten thousand requests a day. what’s amazing to me is the number of persistent attempts in the face of repeated errors. for my polling, i am emailed errors when they occur, so i would know pretty quickly when i’ve been blocked.

blo.gs does make if-modified-since requests. of course, since it is only making requests when there should be changes (it only does so in response to pings, it doesn’t poll), it doesn’t make much difference.

feedster now open for pinging

feedster now has a ping interface. no changes.xml or any other way for others to get the ping data, naturally. i almost feel guilty for the whole feedmesh thing not getting any traction. but blo.gs is currently hobbled by a bug in mysql, and i have next to no energy or enthusiasm for dealing with computers outside of work. which isn’t to say that it is much greater inside of work. i did finally make blo.gs self-regulating to the point that when the server crashes and has to rebuild, it doesn’t just grind to a halt.

one thing jeremy noted about web 2.0 is that he plugged feedmesh at the “dialing on the app tone” session. too bad the effort seems stuck. no other sites appear to have moved any further along on sharing the ping data that they are currently receiving. the only change is that i have introduced a stream of changes that nobody is using.

why i don’t offer paid accounts (and discourage javascript blogrolls)

the recent migration woes are why i don’t offer paid accounts and discourage javascript blogrolls. the latter are a pretty bandwidth-intensive service to provide, which could only really be justified by doing the former, which just creates people who have a legitimate reason to complain when you don’t let the service take over your life. (there’s reactions from some users to the original pre-weekend “all clear” when that turned out to not quite be the case.)

blo.gs pings via mod_pubsub

with a pubsub client, you can now get the feed of pings from http://localhost:8000/kn/pings. there's no end-user interface on it, but you can use the subscribe.py client from the mod_pubsub distribution.

this may change, or go away. right now, it only includes results from pings, not the result of aggregated update information.

i have taken this down. the thought from people that know is that something like xmpp or a custom protocol will be more appropriate.

the blo.gs cloud interface

there are three services that are currently hooked up to the blo.gs cloud interface. there are 96 different hosts that polled one of the changes files yesterday.

automated blogspot.com accounts

now here’s an experience i can relate to from my homepage.com days (it was a second-generation geocities-like service) — i just happened to stumble upon a large series of blogspot.com sites (a few hundred) that were all just a bunch of repeated links to another site. (which looks like a personal site, which is even more odd.)

now i just eliminated all of the blogs with the similar names from the blo.gs database (it was a couple of characters followed by a number), but if they had been clever enough to give each account truly distinct names, this is the sort of mess that would have been difficult to clean up. (for me — presumably blogger could simply nuke all of the blogspot.com sites that had repeated links to the sites to which these blogs were linking.)

it’s all about the benjamins

technorati has reportedly raised some venture capital (via anil dash), but the thing that struck me is the valuation and amount of investment: $12 million valuation, $6.5 million investment. that would put the VCs in control.

i wonder if i’m being dumb for not pursuing blogging-related dollars.

spammer fallout

one fallout effect of the ping spam i eliminated yesterday is that various blo.gs search queries appear as search results on the major search engines. so now when the search page gets a referral from one of those search engines, it redirects over to the fbi. originally, i was just making them submit the search manually. but these are obviously not bright people, and they just went ahead and did the search on blo.gs anyway. they really are that desperate for their free porn.

here’s a fun search that brought up a blo.gs search result: “pictures of nude men in fishing waders”. in fact, it’s the top listing on google right now. (soon to be supplanted by this entry, i suppose.)

needless to say, the search engines are now excluded from the search page, and this will be a non-issue once the entries drop out of the search databases.

spamming the blog listing services

so as i’m cleaning up and optimizing blo.gs, i’ve stumbled across a huge set of spammed entries that came in via weblogs.com. this particular bunch will be easy to block going forward, and it inspired yet another couple of ping-spamming defenses.

one of the things i’m working on is doing better logging of some things so that i can more easily run reports that will show the statistical outliers. for example, most IP addresses only ping blo.gs a few times a day, at most. so addresses that do more than that are ripe for investigation.

also, since all updates are now going through a common function, i don’t need to worry about updating rules to handle direct pings as well as information imported from weblogs.com and blogger.com.

i’m taking advantage of some nifty new features of mysql 4.1 to optimize some bits of the site, like b-tree indexes on in-memory tables. i’ve also converted everything over so that the database realizes that it is storing utf-8 data.

so the site is back up, and hopefully won’t need to be taken down for as long a period as it was today, at least for a while. i hope i’m done with the schema changes to the main blogs table, though. it takes about 20 minutes to rebuild the table each time i change the schema.


last month, the revenue from google adsense ads was almost exactly equal to my monthly hosting bill.

i’m getting too old for this

blo.gs appears to be under some sort of strange indirect attack. it’s being hammered by various robots trying to index search results. so there’s requests from the feedster bot, the technorati bot, and all sorts of bots i’ve never even heard of. oh, and the the yahoo!, google, and msn bots have all decided to get in on the action.

it is extremely strange. perhaps it is supposed to be some sort of clever distributed denial of service?

traffic growth of blo.gs server

the thing that fascinates me most about the traffic growth on this server is how much incoming bandwidth it chews. (mostly the result of having to pull down web content in response to ping requests on blo.gs, i believe.)

i finally enabled compressed content handling for the pings. apparently my install of php and libcurl finally caught up with the times when i wasn’t looking. now we’ll see what blows up.

does LWP really not have gzip or deflate handling? that seems very odd. i couldn’t find any other perl package that implements it, either.

blogging-related business seeks stupid money

jeremy pointed out the spurt in investments of RSS-related companies lately (although i think blogging-related would be a better data set). if i can just convince some venture capital that opening a video store is part of the overall strategy for blo.gs, maybe i’ve solved the start-up capital problem.


it popped into my head to check something recently. the number of blogs added to blo.gs, per day, since june 15:

| added      | new blogs |
| 2004-06-15 |      8118 |
| 2004-06-16 |      8170 |
| 2004-06-17 |      7362 |
| 2004-06-18 |      2512 |
| 2004-06-19 |      4299 |
| 2004-06-20 |      7802 |
| 2004-06-21 |      9264 |

well, that took longer than expected

mysql> alter table blogs type=innodb;
Query OK, 2027382 rows affected (1 hour 32 min 41.49 sec)
Records: 2027382  Duplicates: 0  Warnings: 0

i did catch an out-of-control blog notification bot that may have been chewing up memory and otherwise getting in the way for most of that period.

inbox victory

i now have only four items in my personal inbox: three mails mentioning various books to read (two were comments here), and one about getting blogdrive.com to do an authenticated ping of blo.gs so it doesn’t need to fetch the page to verify it was updated. (something i suggested but haven’t gotten around to implementing and documenting how to do.)

people using blo.gs ping data

if you create a new blog, and only ping blo.gs, you'll soon be visited by robots from: radio vox populi (apparently twice: once from a server at the mit media lab, and another from the media lab europe), mercubot, que pasa corporation (related to this?), ibm/sequent (which will try to fetch various filenames looking for rss or atom feeds), feedster (which will also try various filenames), and blogshares.

the only one which will fetch robots.txt is the one from the quepasa.com. the robots from radio vox populi and ibm/sequent only identify themselves as libwww-perl.

go mavs!

okay, so i don’t really care about the mavericks. but mark cuban has a weblog, and it has started strong. i would add it to the list of blogs i read, but it doesn’t ping blo.gs (or weblogs.com). none of the weblogs, inc. blogs do. odd.

maybe i should finally hook up the plumbing so you can add non-pinging blogs to your favorites on blo.gs.

zeldman on top

one of the things i regret not doing with blo.gs is tracking the evolution of the most watched blogs. i did at least set up things so i can generate some numbers about the growth rate of blogs (about 5000 new ones per day), updates per day (about 80,000), and users (about 20 new ones per day).

(if i were able to maintain any focus at all, being more rigorous about logging data that would be interesting to graph would be a good thing to do in just about every facet of my life. where’s my focusyn?)

blog name handling at blo.gs

so some idiot is pinging blo.gs to change the name for various other blogs in the portuguese weblogging community (which just loves to slurp up my bandwidth — but that’s a different story).

i guess i’ll have to stop trusting the name being passed on pings, and start sniffing them out of the feeds and html pages (and ignoring changes that come via weblogs.com, which is subject to the same abuse).

maybe this will finally push me over the edge into rewriting the ping-handling code in python so i can more easily re-use mark pilgrim’s feed finder.

atom on blo.gs

should i add another field for a site’s atom feed, or simply rename the rss field to feed, and let people choose which format to list? adding a field would require extending the ping interface.

pinging for s9y

i sent this to the serendipity mailing list, but because sourceforget has wretched mailing list archives (they haven’t been updated since the beginning of the month for the s9y list), i have no idea whether it actually made it. anyway, here is a patch to add support for pinging blo.gs, weblogs.com, blogrolling.com, and technorati.com to s9y. it is not a very slick integration, but i figured it would be enough to get the ball rolling for the s9y developers. (the patch is against yesterday’s snapshot.)

i now return you to your irregular scheduled programming.

more than a million

at some point in the last few days (on thanksgiving, it appears), blo.gs finally hit the one million blog mark. 149,539 have not been updated in more than six months. 692,205 haven’t been updated in more than a month. 147,021 were added within the last month. there’s been about three new blogs per minute for the last seven days.

these comments from mark pilgrim about dealing with weblog comment spam are a good healthy dose of realism. being on the opposite side of a problem from people who are willing to throw more time or money at it than you are can be an uncomfortable place to be.

(around here, all comments get emailed to me, and i can trivially delete them from within my mail client. but i could count the number of comment spams i’ve gotten without running out of fingers.)

side note: “for vcs, blogging is the next big thing” who needs vc? i got my first google adwords check last week. vegas, here i come!

detecting improperly encoded text (in perl)

i need a way to detect when a string has been double-encoded into utf-8. that is, a string of utf-8 bytes that was basically treated to an iso-8859-1 to utf-8 conversion.

this will help blo.gs deal with the encoding bugs in blogger.com's changes feed. (which, unfortunately, is not consistently broken: sometimes the encoding is right, sometimes the encoding is wrong. at least, i think sometimes the encoding is right, although i can’t find any examples right now.

what would be even better, of course, would be for blogger to fix the bug. i reported it, and got a “we know, we hope to resolve the problem soon” response.

looks like they could take a lesson from joel spolsky's mini-tutorial on unicode. (i’ll admit to being surprised that blogger gets it wrong: i was under the impression that they used java, which i believe has pretty solid unicode support.)

matt haughey has written an excellent article about his experiences with google’s adsense on pvrblog.

for last month, the advertising on blo.gs came close to covering the hosting for the server. that’s not nearly as much money as matt is talking about for pvrblog, but i’m surprised how blo.gs pulls in anything at all, considering the site is so content-light. except for a two-day spike after the redesign, there hasn’t been a significant change in the amount of money being earned by the site.

new blo.gs design

thanks to the groovy albert reinhardt of happygofun, blo.gs finally is sporting a snappy new design.

one thing that is now buried (and basically non-existent) is a way to see the raw stream of blogs being updated. with the addition of the blogger.com info, there's just too damn many blogs to be useful. my numbers only go a few days back, but there's around 5000 new blogs per day being added.

the upside of not putting the stream of all updates front-and-center is that the front page is now relatively tiny. it will be interesting to see the impact on the site's bandwidth usage. (also interesting to see will be the effect of the redesign on the google adsense revenues.)

the blame for the less-than-optimal rendering on internet explorer falls squarely on my shoulders. or rather, the decision to only apply limited hacks to work around the abhorrently buggy internet explorer can be blamed on me. the fact that internet explorer can't render straightforward html and css is somebody else's fault.

the fight to get the aim bot to not get shut out continues. the trick i borrowed from jaimbot of having the bot send messages to itself proved unfruitful. eventually aim would only deliver those messages, and not any others.

no more blogsbot on yahoo?

it sounds like yahoo is retiring the old version of the instant messenger protocol, which may mean that the Net::YMSG package used by the blo.gs bot that relays info on updated blogs to yahoo! instant messenger users will stop working. (if it was working at all.)

meanwhile, the aim bot is still flaky, but i think i may have found the root of the problem, and am now figuring out how to fix it. (it takes a while for the problem to occur, so debugging and testing is slow).