February, 25, 2024 archives
Grinding the ArchiveBox
I have been playing around with setting up ArchiveBox so I could use it to archive pages that I bookmark.
I am a long-time, but infrequent, user of Pinboard and have been trying to get in the habit of bookmarking more things. And although my current paid subscription doesn’t run out until 2027, I’m not paying for the archiving feature. So as I thought about how to integrate my bookmarks into this site, I started looking at how I might add that functionality. Pinboard uses wget
, which seems simple enough to mimic, and I also found other tools like SingleFile.
That’s when I ran across mention of ArchiveBox and decided that would be a way to have the archiving feature I want and don’t really need/want to expose to the public. So I spun it up on my in-home server, downloaded my bookmarks from Pinboard, and that’s when the coding began.
ArchiveBox was having trouble parsing the RSS feed from Pinboard, and as I started to dig into the code I found that instead of using an actual RSS parser, it was either parsing it using regexes (the generic_rss
parser) or an XML parser (the pinboard_rss
parser). Both of those seemed insane to me for a Python application to be doing when feedparser has practically been the gold standard of RSS/Atom parsers for 20 years.
After sleeping on it, I decided to roll up my sleeves, bang on some Python code, and produced a pull request that switches to using feedparser
. (The big thing I didn’t tackle is adding test cases because I haven’t yet wrapped my head around how to run those for the project when running it within Docker.)
Later, I realized that the RSS feed I was pulling of my bookmarks would be good for pulling on a schedule to keep archiving new bookmarks, but I actually needed to export my full list of bookmarks in JSON format and use that to get everything in the system from the start.
But that importer is broken, too. And again it’s because instead of just using the json
parser in the intended way, there was a hack to work around what appears to have been a poor design decision (ArchiveBox would prepend the filename to the file it read the JSON data from when storing it for later reading) that then got another hack piled on top of it when that decision was changed. The generic_json
parser used to just always skip the first line of the file, but when that stopped being necessary, that line-skipping wasn’t just removed, it was replaced with some code that suddenly expected the JSON file to look a certain way.
Now I’ve been reading more Python code and writing a little bit, and starting to get more comfortable some of the idioms. I didn’t make a full pull request for it, but my comment on the issue shows a different strategy of trying to parse the file as-is, and if that fails, skip the first line and try it again. That should handle any JSON files with garbage in the first line, such as what ArchiveBox used to store them as. And maybe there is some system out there that exports bookmarks in a format it calls JSON that actually has garbage on the first line. (I hope not.)
So with that workaround applied locally, my Pinboard bookmarks still don’t load because ArchiveBox uses the timestamp of the bookmark as a unique primary key and I have at least a couple of bookmarks that happen to have the same timestamp. I am glad to see that fixing that is project roadmap, but I feel like every time I dig deeper into trying to use ArchiveBox it has me wondering why I didn’t start from scratch and put together what I wanted from more discrete components.
I still like the idea of using ArchiveBox, and it is a good excuse to work on a Python-based project, but sometimes I find myself wondering if I should pay more attention my sense of code smell and just back away slowly.
(My current idea to work around the timestamp collision problem is to add some fake milliseconds to the timestamp as they are all added. That should avoid collisions from a single import. Or I could just edit my Pinboard export and cheat the times to duck the problem.)