It's the the sixth anniversary1 of my TechCrunch Feed Filter. In Internet time, that's ages. A lot has changed. Here's a look back.
TechCrunch was a great blog about innovation and entrepreneurship. As it grew, it published more articles than I cared to read. Like many savvy blog readers, I used a feed reader to present the latest articles to me, but TechCrunch was simply too profuse.
I created a service that'd visit TechCrunch's feed, and make note of who made which articles, what the articles were about, how many comments each article had, and how many Diggs2, Facebook likes and Facebook shares each article had.
With that data, the service would determine the median, mean, standard deviation, and create a minimum threshold for whether the article merited being seen by me. The raw data is stored in a live yaml file. There were some special rules, like, "If the article is by Michael Arrington, or has "google" in the tags field, automatically show it to me." Otherwise, other readers had to essentially vote the article high enough for it to pass the filter.
In the picture above, you can see that two posts out of seven met the criteria to be in the filtered feed. They're the ones with the gold stars. The threshold was calculated to be 116 shares, and you can see in the graph when each article had more than the threshold. (There's a red circle at the point the green shares line rose above the blue area that designates the criteria level.)
Once the service knew which posts were worthy of my attention, it listed them in its own filtered feed.
Changes over Time
In the beginning, TechCrunch used WordPress's commenting system. As such, its feed included the slash:comments tag. At the time, that was the best metric of how popular a TechCrunch post was, better than Facebook shares. But TechCrunch started experimenting with different commenting systems like Disqus and Facebook comments to combat comment spam. Neither of those systems used a standard mechanism to get comment counts, so every time it changed commenting systems, I had to change my service.
Digg, whose Diggs were once a great metric of the worthiness of a TechCrunch blog post, faded away. So I had to stop using Diggs.
So that left Facebook's metrics. They weren't ideal for assessing TechCrunch articles, but they were all that was left. Using Facebook likes and shared worked for a while. And then Facebook changed their APIs! They once had an API, FQL, that let you easily determine how many likes and shares an article had. The killed that API, leaving me with a slightly more complicated way to query the metrics I need for the service to do its work.
Not The End
I've had to continuously groom and maintain the feed filter over these past six years as websites rise, fade, and change their engines. And I'll have to keep doing so, for as long as I want my Feed Filter to work. But I don't mind. It's a labor of love, and it saves me time in the long run.
2 Remember Digg? No? Young'un. I still use their Digg Reader.
In June 2013, I investigated various online backup solutions, and decided on a hybrid solution: Use the free 5GB of iCloud and then backup the rest to DreamObjects.
That didn't work out.
- boto-rsync only uses file-size to decide to update a file. There'd be too many false negatives for my comfort level.
- I was still using DreamHost Backup which uses rsync proper, and the first 50GB were free. At that price, it was irresistible.
- If I used duplicity, I wasn't comfortable (yet) with the cost associated with the incremental backups to DreamObjects.
It's a new world. And it became time to revisit my remote backup strategy.
- Windows 10 will support Bash soon! This means I could use the same rsync scripts from each of my devices whether Raspbian, other Linux, Mac OS, or Windows. (And without having to use cygwin!)
- "If you're not the customer, you're the product." I observed this fact when I used OpenPaths, and its service became unavailable. And again while I was backing up to DreamHost Backup, and it was discontinued. No more free remote rsync for me!
And then I discovered rsync.net. I don't know why I hadn't noticed them before, they've been around for years. It turns out that theirs is the service that I've been waiting for.
- I'm the customer. I pay for the storage, so they're accountable to me. I've only heard good things about their customer support.
- And as far as rsync itself is platform agnostic, so is their service. And I'm already using rsync for local backup on all my computers. Setting rsync.net as my remote backup was very easy.
- I already treat confidential data specially, so even on local disk it's encrypted.
I'm really happy with rsync.net, and my Windows environment is going to be even more straightforward when I can use Windows Subsystem for Linux (WSL). The devices I'm still not certain about are the family's phones. Here's what we're doing so far.
- We're doing rare full phone backups to local computers.
- We're using Google Photos apps to sync photos more frequently from the phone to the cloud. (We're Google's product. It's a deal with the devil, I know.)
- I use a different source for Contacts and Calendar data, and the phone is merely a sync target, never a source.
Two oddities that WSL might make better
I have to have two strange lines of code because I use CygWin and rsync to back things up. Applications like PuTTY, which store preferences in the Windows Registry, and FileZilla which store preferences in %APPDATA% need special treatment.
I copy PuTTY's settings to my Documents folder.
regedit /e "%USERPROFILE%\Docs\PuTTY.reg" HKCU\Software\SimonTatham
I backup FileZilla's settings explicitly.
rsync $(cygpath $APPDATA/FileZilla) email@example.com:APPDATA/
Who knows? Maybe there'll be another update about my backup strategy in the future.
An RSS or Atom feed is a live document of recent events or items. For example, there are feeds of Twitter tweets, Facebook posts, and YouTube videos. Usually, other programs track the feeds, and then let you know when there's something new.
You can do other things with feeds, too. I've collated a few of my personal feeds into a Lifestream that updates automatically. It's a very convenient hands-off diary.
Companies like Twitter and Facebook have been trying to figure out if they should monetize feeds. They already inject ads into their main user interfaces (into Facebook's users' walls, and onto the Twitter timeline, for example), but what should they do about the feeds for the same information? So far, they've left feeds alone.
This week, Delicious tried injecting ads into its users' bookmarks feeds. The bookmarks feeds are activity feeds, because they reflect actions that the users have taken. On Delicious, users bookmark sites that they find noteworthy. So these new ads are clearly marked as Sponsored items, because otherwise it would look like everyone suddenly explicitly bookmarked the ad's product. That'd be straight up deception.
Still, there are problems with the way that Delicious is embedding ads into their users' activity feeds. To ensure that the ads are at the top of the list, they're always coded with a pubDate ("Published Date") of the current time. This goes against the original intent of these feeds, where each item's pubDate doesn't change if there's no actual change to the item itself. Old ads should get old, too.
Instead what happens is that every time the feed is fetched, since there's always a "new" item (the same old ad, but with a new, just-now pubDate), is that Delicious can no longer return HTTP code 304, which means, "I'm not providing the feed again, it hasn't changed since the last time you got it."
This is causing Delicious to be slammed with having to deliver full feeds for every request, and for all the clients to have to process what look like brand-new feeds. Beyond that, people who auto-tweet what they bookmark are all inadvertently tweeting ads now.
How did I discover this in the first place? My Lifestream started journalling that I've been repeatedly bookmarking the same ad over and over again.
That's not the way to monetize activity feeds.
Last week, there was an insinuation that I had died. In a way, I made the insinuation myself. It was produced by a five year old script that runs continuously on its own. ("But wait, that linked blog post isn't five years old!" you might say. Actually, the script had been cooking for a couple of years before I wrote the post.)
So what happened? The script couldn't detect any online activity from me for three days, and so sent out preliminary notification that it was worried about me. This worked perfectly. … Well, perfectly for technology that'd been hibernating for five years (which is more like fifty years in Internet time) and tried to talk with entities in the new world into which it emerged.
What happened to my script, waking up to a changed world, reminds me of the home automation in There Will Come Soft Rains. (It's a great little story. Go ahead and find it and read it. I'll wait.)
The travails of my little Rip Van Winkle Dead Man's Switch presents us with an opportunity. Let's recount what really happens when you write a web app that's supposed to survive on its own for an indefinite number of years. There are a lot of moving parts, let's see how they tie together.
In 2008, WordPress's dashboard had taken a turn for the worse. Michael Heilemann published a handful of screencasts perfectly illustrating the bad UX decisions WordPress had made and had become an Emeritus Member of the Habari Project Management Committee. Habari was an ascendant 2nd generation blogging platform that had learned from its predecessors. Its native remote API was the similarly ascendant AtomPub. WordPress also supported AtomPub.
In 2010, I wrote my Dead Man's Switch (DMS), and used AtomPub to have it publish to my two blogs (one Habari, one WordPress). This was just one of many foundational decisions I made based on the trajectory of technologies at the time. I ensured the DMS worked, and then left it to do its work: watch over me.
In 2015, it triggered. And couldn't do its job..
While it was idling, the world was changing. Here are some things that had changed:
- DreamHost is migrating its shared hosting accounts to PHP 5.5 and 5.6. 5.6 is a breaking change for certain XMLRPC implementations. XMLRPC's MetaWeblog is the primary alternative to AtomPub. This would have broken XMLRPC if I had chosen it for my DMS. Lucky I chose AtomPub. Right?
- WordPress dropped support AtomPub in 2012, in version 3.5. Apparently it wasn't so ascendant. This did break compatibility with the DMS.
- Twitter changed from using basic authentication to OAuth. I caught this. But later it changed its API protocol from HTTP to HTTPS. Sensibly, the HTTP URLs did not forward to their new analogues. But this did break compatibility with the DMS.
- Netflix dropped support for their Disks at Home feed. Until I fixed it, this prevented the DMS from noticing that I was actively watching Netflix movies. I wrote a replacement and made it open source.
- last.fm scrobbling stopped working. This prevented my DMS from noticing which songs and podcasts I was listening to.
- YouTube dropped its awesome support for RSS. Once upon a time, they supported various custom feeds. It was super handy. Of course, without it, my DMS can no longer tell what I've been watching or favoriting.
- Amazon dropped support for its wishlist API. So my DMS wouldn't know if I've added something to my wishlist.
- DreamHost dropped support for SMTP port 25. This would've prevented my DMS from emailing had I not caught it.
Here are some things that remained the same, for better or for worse:
- Amazon's acquired Shelfari site never added support for RSS. Once upon a time, Amazon was going to web-API everything. How would my DMS know when I've read a book?
- Amazon's acquired IMDB never completed its official API, either.
- Google+ created a read-only API. There's still no API to post messages, so my DMS wouldn't be able to post on my behalf..
- DreamHost was still there, running my DMS, and it successfully woke up.
- Sites like Facebook and Twitter were still there to watch and post to.
- Python 2 still has some support and incremental updates did not break the DMS.
I can't say that watching these breaking changes occur is unexpected. But I think there's value in documenting a real use case of web software that's meant to last on its own for so long. With any luck, I'll be here in another five years to give another status update. If not, hopefully my DMS can fill you in for me.
"Give. Me. The data."
That's what many Quantified Self adherents said back in 2011 when Apple admitted that they accidentally had their phone save a database of geopositional data over recent past months. Apple promised to delete most of the database in the next update.
That bug didn't sound bad to me, it sounded awesome! I grabbed that database as quickly as I could, and looked for a service to continue the geoposition scraping so I could implement a location predictor.
Users can securely store and manage their personal location data, and grant researchers access to portions of that data as they choose.
That's quite the magnanimous service! Neither the users who run the app nor the researchers who gather the data are paying anything for the service. We're users, but we're not paying customers. The NYTLabs must be running it out of the goodness of their own hearts, and the users and researched are indebted to them.
The downside became apparent pretty quickly. As there are no paying customers, the service doesn't get a lot of monitoring or attention. API calls can take a minute or more. Sometimes they fail, and sometimes the service simply goes down for days. The only appeal we can make as users are emotional appeals. We have no financial leverage.
Given that I'm not the customer of OpenPaths, but my data is the product, I went searching for a redundant service where I am the customer. I came across FollowMee, another app with the same feature set. It's a paid service, and the developer is responsive on their bulletin board. I've installed it as a backup and potential replacement for OpenPaths.
There's a saying that generally holds true of online services:
If you're not the customer, you're the product.
That's why I'm running both the free service OpenPaths, to share my geopositional data with researchers, and the paid service FollowMee, for some assurance the service will work while I pay for it. I'll continue to run FollowMee along OpenPaths while they both do their respective jobs.