Entries tagged 'ArchiveBox'
But I still haven’t found what I’m looking for
I’m still looking for a job.
It is a new month, so I thought it was a good time to raise this flag again, despite it being a bad day to try and be honest and earnest on the internet.
I wish I was the sort of organized that allowed me to run down statistics of how many jobs I have applied to and how many interviews I have gone through other than to say it has been a lot and very few.
Last month I decided to start (re)developing my Python skills because that seems to be much more in demand than the PHP skills I can more obviously lay claim to. I made some contributions to an open source project, ArchiveBox: improving the importing tools, writing tests, and updating it to the latest LTS version of Django from the very old version it was stuck on. I also started putting together a Python library/tool to create a single-file version of an HTML file by pulling in required external resources and in-lining them; my way of learning more about the Python culture and ecosystem.
That and attending SCALE 21x really did help me realize how much I want to be back in the open source development space. I am certainly not dogmatic about it, but I believe to my bones that operating in a community is the best way to develop software.
I think my focus this month has to be on preparing for the “technical interview” exercises that are such a big of the tech hiring process these days, as much as I hate it. I think what makes me a valuable senior engineer is not that I can whip up code on demand for data structures and algorithms, but that I know how to put systems together, have a broader business experience that means I have a deeper of understanding of what matters, and can communicate well. But these tests seem to be an accepted and expected component of the interview process now, so it only makes sense to polish those skills.
(Every day this drags on, I regret my detour into opening a small business more. That debt is going to be a drag on the rest of my life, compounded by the huge weird hole it puts in my résumé.)
Grinding the ArchiveBox
I have been playing around with setting up ArchiveBox so I could use it to archive pages that I bookmark.
I am a long-time, but infrequent, user of Pinboard and have been trying to get in the habit of bookmarking more things. And although my current paid subscription doesn’t run out until 2027, I’m not paying for the archiving feature. So as I thought about how to integrate my bookmarks into this site, I started looking at how I might add that functionality. Pinboard uses wget
, which seems simple enough to mimic, and I also found other tools like SingleFile.
That’s when I ran across mention of ArchiveBox and decided that would be a way to have the archiving feature I want and don’t really need/want to expose to the public. So I spun it up on my in-home server, downloaded my bookmarks from Pinboard, and that’s when the coding began.
ArchiveBox was having trouble parsing the RSS feed from Pinboard, and as I started to dig into the code I found that instead of using an actual RSS parser, it was either parsing it using regexes (the generic_rss
parser) or an XML parser (the pinboard_rss
parser). Both of those seemed insane to me for a Python application to be doing when feedparser has practically been the gold standard of RSS/Atom parsers for 20 years.
After sleeping on it, I decided to roll up my sleeves, bang on some Python code, and produced a pull request that switches to using feedparser
. (The big thing I didn’t tackle is adding test cases because I haven’t yet wrapped my head around how to run those for the project when running it within Docker.)
Later, I realized that the RSS feed I was pulling of my bookmarks would be good for pulling on a schedule to keep archiving new bookmarks, but I actually needed to export my full list of bookmarks in JSON format and use that to get everything in the system from the start.
But that importer is broken, too. And again it’s because instead of just using the json
parser in the intended way, there was a hack to work around what appears to have been a poor design decision (ArchiveBox would prepend the filename to the file it read the JSON data from when storing it for later reading) that then got another hack piled on top of it when that decision was changed. The generic_json
parser used to just always skip the first line of the file, but when that stopped being necessary, that line-skipping wasn’t just removed, it was replaced with some code that suddenly expected the JSON file to look a certain way.
Now I’ve been reading more Python code and writing a little bit, and starting to get more comfortable some of the idioms. I didn’t make a full pull request for it, but my comment on the issue shows a different strategy of trying to parse the file as-is, and if that fails, skip the first line and try it again. That should handle any JSON files with garbage in the first line, such as what ArchiveBox used to store them as. And maybe there is some system out there that exports bookmarks in a format it calls JSON that actually has garbage on the first line. (I hope not.)
So with that workaround applied locally, my Pinboard bookmarks still don’t load because ArchiveBox uses the timestamp of the bookmark as a unique primary key and I have at least a couple of bookmarks that happen to have the same timestamp. I am glad to see that fixing that is project roadmap, but I feel like every time I dig deeper into trying to use ArchiveBox it has me wondering why I didn’t start from scratch and put together what I wanted from more discrete components.
I still like the idea of using ArchiveBox, and it is a good excuse to work on a Python-based project, but sometimes I find myself wondering if I should pay more attention my sense of code smell and just back away slowly.
(My current idea to work around the timestamp collision problem is to add some fake milliseconds to the timestamp as they are all added. That should avoid collisions from a single import. Or I could just edit my Pinboard export and cheat the times to duck the problem.)