Every few years, disparate web services seem to conspire to shutdown or break at the same time. It happened again last month. Some of them you may have heard of, but the others broke or disappeared without any prior announcement. I've been through this sort of thing before, so I knew it'd eventually take a little work to keep my services humming along again.
Google Plus announced that they'd shut down their service ten months before doing so. That gave me a few months to avoid addressing the issue, and then eventually getting around to backup up my posts. I love my backup, it's so much faster to load and navigate than Google's service.
Google Image Charts
Google Image Charts gave us seven years to prepare for their shutdown. They even staged one short and one long outage to draw our attention to the upcoming closure. I ignored it the whole time, and then scrambled to find a replacement after the permanent closure.
FourSquare Activity Feeds
FourSquare shut down their user activity feeds without any notice. They were in a sort of limbo as they tracked Swarm activity, but the URLs were still FourSquare. The feed was very important to me because I maintain a lifestream. I became a FourSquare developer to access my content, and discovered that I could make a more informative feed on my own than the feed they'd just killed. Yay! That turned out to be an unexpected win.
Internal Server Errors
My shared web hosting service started suffering intermittent "internal server errors" (HTTP code 500) at the same time as the above changes. I turned off FastCGI, but that didn't help. Then I upgraded PHP from 5.6 to 7.2, which broke my blog engine (Habari), so I fixed that, but the "500"s kept happening. As a last resort, DreamHost moved my site from Virginia to some new servers in Oregon. That fixed the issue, and even better, upgraded my account to a Linux distro that uses Python 3.6! Yay, that means I finally get to use f-strings at DreamHost! DreamHost was the last place where I couldn't use f-strings, so I'm really happy that they've caught up.
Despite the handful of shutdowns and failures, I ended up with a couple of unexpected wins over the previous situation, so I'm pretty happy about that.
Here's a puzzle my son put in my Father's Day card. He wrote, "In the figure below, fill in each of the sixteen numbers from 1 to 16 such that the four rows and three columns (use the lines as a guide) add up to 29."
You can tell from the pencil markings that I tried to solve it by hand. But it wasn't long before I thought about trying to solve it with a Python script. There's a naïve script that you could write:
for permutation in itertools.permutations(range(1, 17)): if rows_and_cols_add_up(permutation): print permutation
That one just tries every single ordering of the numbers and then presumably checks things like sum(p, p, p) == 29 in the testing function. The only problem with that approach is that the potato I'm using for a computer might not finish going through the permutations before the next Father's Day. (No, really. There are over 20 trillion permutations to check.)
Fast but Not Exactly Correct
So I decided to attack it from another angle, just look at the rows first. There are four rows, let's call them rows a, b, c, and d from top to bottom. Row a has a 3-tuple in it, b has a 4-tuple, c has a 4-tuple, and d has a 3-tuple. I wrote an algorithm that first finds all the 3-tuples and 4-tuples that add up to 29.
From those sets of valid 3-tuples and 4 tuples, the algorithm chooses a 3-tuple, a 4-tuple, another 4-tuple, and finally a 3-tuple. It assures that there are no duplicate numbers across any of the tuples. Then if that's true, it checks the three columns, and if they add up to 29, it prints the winning results.
And voilà, it worked, and demonstrated that there are multiple solutions, and found them in mere seconds! I saved that initial implementation as a gist. I thought I was done, but the code was bugging me. I knew that while it reflected the way I approached the problem, it was inefficient (it did too many comparisons), it included duplicate solutions (that's the "wrong" part of it), and it wasn't Pythonic.
Correct but Slower
So, a few hours later, I refactored the code, and realized I could replace four nested "for" loops (one for each row) with two loops that each choose two items from their containers (one container of 3-tuples, and one of 4-tuples). And I could check for no-duplicates on one line using a set() instead of explicitly checking for duplicates with "any(this in that)" calls.
The refactor replaced 22 lines of deeply nested code with four lines of simpler code. Not only were there fewer lines, I hoped the program became more efficient because it's iterating fewer times. That's the best kind of change! It's smaller, easier to reason about, and should be faster when executed. It's very satisfying to commit a change like that! The changes to the gist can be seen here:
The red lines are being replaced by the green lines. It's nice to see lots of code being replace by fewer lines of code.
Correct and Fast
I was wrong about it being faster!
Profiling showed that while the change above was more correct because it didn't include duplicate results, it was inefficient because it lacked the earlier pruning between the for loops from the first try. Once I restored an "if" call between the two new for loops, the program became faster than the first run.
for b, c in itertools.combinations(t4, 2): if len(set(b + c)) == 8: for a, d in itertools.combinations(t3, 2): if len(set(a + b + c + d)) == 14: test_rows(a, b, c, d)
It's very rewarding to bend your way of thinking to different types of solutions and to be able to verify those results. This feeling is one of the reasons I still code.
Here's the final version of the solution generating code.
TechCrunch was a great blog about innovation and entrepreneurship. As it grew, it published more articles than I cared to read. Like many savvy blog readers, I used a feed reader to present the latest articles to me, but TechCrunch was simply too profuse.
I created a service that'd visit TechCrunch's feed, and make note of who made which articles, what the articles were about, how many comments each article had, and how many Diggs2, Facebook likes and Facebook shares each article had.
With that data, the service would determine the median, mean, standard deviation, and create a minimum threshold for whether the article merited being seen by me. The raw data is stored in a live yaml file. There were some special rules, like, "If the article is by Michael Arrington, or has "google" in the tags field, automatically show it to me." Otherwise, other readers had to essentially vote the article high enough for it to pass the filter.
In the picture above, you can see that two posts out of seven met the criteria to be in the filtered feed. They're the ones with the gold stars. The threshold was calculated to be 116 shares, and you can see in the graph when each article had more than the threshold. (There's a red circle at the point the green shares line rose above the blue area that designates the criteria level.)
Once the service knew which posts were worthy of my attention, it listed them in its own filtered feed.
Changes over Time
In the beginning, TechCrunch used WordPress's commenting system. As such, its feed included the slash:comments tag. At the time, that was the best metric of how popular a TechCrunch post was, better than Facebook shares. But TechCrunch started experimenting with different commenting systems like Disqus and Facebook comments to combat comment spam. Neither of those systems used a standard mechanism to get comment counts, so every time it changed commenting systems, I had to change my service.
Digg, whose Diggs were once a great metric of the worthiness of a TechCrunch blog post, faded away. So I had to stop using Diggs.
So that left Facebook's metrics. They weren't ideal for assessing TechCrunch articles, but they were all that was left. Using Facebook likes and shared worked for a while. And then Facebook changed their APIs! They once had an API, FQL, that let you easily determine how many likes and shares an article had. The killed that API, leaving me with a slightly more complicated way to query the metrics I need for the service to do its work.
Not The End
I've had to continuously groom and maintain the feed filter over these past six years as websites rise, fade, and change their engines. And I'll have to keep doing so, for as long as I want my Feed Filter to work. But I don't mind. It's a labor of love, and it saves me time in the long run.
2 Remember Digg? No? Young'un. I still use their Digg Reader.
I'm going to retire a technical interview question that I've had ready to ask, but turns out to probably be too difficult. It's more a wizard-level interview question, or a casual technical discussion with peers who don't feel like they're under the gun.
How many ways are there to break a U.S. dollar into coins?
It seemed appropriate to ask this to some candidates, because once you know what you're trying to do, the algorithm in Python naturally reduces down to five or six lines of code.
def ways_to_break(amount, coins): this_coin = coins.pop(0) # If that was the last coin, only one way to break it. if not coins: return 1 # Sum the ways to break with the smaller coins. return sum([ways_to_break(amount - v, coins[:]) for v in range(0, amount+1, this_coin)]) print ways_to_break(100, [100, 50, 25, 10, 5, 1])
You could click on the gist with more comments and better variable names if you want to try to better understand that. It's not meant to be production code, it's a whiteboard snippet made runnable.
Here's what I hoped to look for, from the candidate: After maybe clarifying what I wanted, ("what about the silver dollar or half-dollar coins? Did you want permutations or combinations?"), the problem should seem well-defined, but hard. Then, I'd hope the candidate would break it down to the most trivial cases: "How many ways are there to break 5¢ if you have pennies and nickels? How about 10¢ with pennies, nickels and dimes?" Can these trivial cases be used to compose the bigger problem?
At that point, different areas of expertise would appear. Googlers might start jumping towards MapReduce, and already figure that Reduce is simply the summation function because the question is "how many", and have the thing worked out at scale.
Imperative programmers, having thought about the simplest cases, probably clue in that there's an iterative approach and a recursive approach.
Pythonistas who prefer functional programming probably have reduce(lambda x, y: x+y [x, y in ...]) ready to go.
That got me thinking. I didn't see the need for reduce and operator.add and enum. My solution above isn't above criticism, but I think it's easier to read than some of the alternatives. In truth, it really depends on who that reader is, and how their experience has conditioned them.
While writing the algorithm, I searched the question online, and was pleasantly surprised to see Raymond Hettinger mentioned for solving the puzzle. He's a distinguished Python core developer. Too bad he probably didn't have Python then, in 2001, to help him solve the problem in a scant few lines.
Taking it up a Notch
Adam Nevraumont proposed the following challenge:
Drop the penny, and try to break $1.01. The Canadian penny has been retired, after all.
He provided the answer: Check that the remaining coin can break the amount. Don't assume it can. This also relieves the restriction that the container of denominations be ordered!
def ways_to_break(amount, coins): this_coin = coins.pop() # If that was the only coin, can it can break amount? if not coins: if amount % this_coin: return 0 else: return 1 # Sum the ways to break with the other coins. return sum([ways_to_break(amount - v, coins[:]) for v in range(0, amount+1, this_coin)]) print ways_to_break(100, [1, 100, 10, 50, 25, 5])
(It's actually more efficient to pop the largest denominations first. But I wanted to show that you can use an unordered container now.)
There's a trope about the dad who works constantly on his favorite old piece-of-junk project in the garage. He may never get it running, but getting it running is not the only point of the project. A bigger point is that it's the thing he goes to to clear his head and recharge. It's something that he can focus intently on, and it's something that's entirely in his domain.
I have a number of slow-burning projects at home, too. But they're not in the garage, they're in the cloud. Usually, they're web services. At work, as part of a team, I write code that goes on embedded devices. But at home, the entire product is mine. It's my chance to be a "full stack" engineer, the CEO, and principal customer all at once. It's nice to take on these different roles.
Updating a File
Let's take a quick look at updating a file. It's one of the most useful and common operations in programming, so it's really very well understood. A naive Python snippet to update a file with new data would look like the following:
with open(filename, 'wb') as f: f.write(data)
What's beautiful about it is that it's cross-platform, and Python's "with" statement takes care of closing the file that was opened, even if an error occurs when writing the data.
But when you deploy code into the real world, you have to dive deeper, and think about what really can go wrong. For example, the automatic closing of the file I mentioned above? Python flushes data to the file, but doesn't necessarily sync the file to the physical disk. And in my case, it really does have to be cross platform, and I can only have one process access the file at a time, and I don't want IO errors corrupting the data. That means finding a cross-platform file lock, doing all the work on the side, and atomically moving the side file to the production file. The following snippet (elaborated a little more at this gist) fixes those problems.
with filelock.FileLock(filename): with tempfile.NamedTemporaryFile(mode='wb', dir=ntpath.dirname(filename), delete=False) as f: f.write(data) f.flush() if platform.system() == "Windows": os.fsync(f.fileno()) # slower, but portable if os.path.exists(filename): os.unlink(filename) # or else WindowsError else: os.fdatasync(f.fileno()) # faster, Unix only tempname = f.name os.rename(tempname, filename) # Atomic on Unix
On the one hand, it's no longer a two-liner. On the other hand, I've got a function that works in all the environments I need it to, and I know exactly how it's going to behave in any exceptional situation.
Removing a Photo from a Web Album
My latest project is a web-based photo album. I could pay Flickr, Google or Apple for their web albums, of course, but they've made UX changes I don't like, and their costs are too high. So, I'm writing my own. It'll never be as polished as the production photo albums, but it'll be my jalopy, and I'll be able to tweak it in any way I choose.
One of the things my album owners need to do is delete photos they no longer want in the album. So, how should I implement that?
That's the command to delete a file. You know it's not going to be that easy. Since we're talking about a web album, we'd want to consider a few things. We want an ideal user experience. It has to be fast, they have to be able to change their minds later, and concurrency can't be a problem. Here are some of the steps involved in deleting a photo from the album.
- Schedule the deletion of the entry in the local metadata file. It'll be done in a side thread or at a later time. Just make sure the user's web page's data refreshes quickly.
- Add a new entry to a list of timestamped-files-to-be-deleted-in-a-month.
- Run a cronjob that works on a regular cycle that actually does timestamp checking and the physical file deletion. (It's actually an S3 key deletion, which gets propagated to multiple cloud static-file servers behind the scenes. Even more remote complexity is encapsulated behind a single function call.)
This is fun! Amirite? This is what working on your own personal project is all about. So let's dive a little more deeply into step 2 above.
Removing an Item from a Container
The task of removing the photo data from a list can be simplified to the task of removing a record from a container (l) when the first field of that record matches a certain criteria (s).
There's the loop:
for item in l: if item == s: l.remove(item) break
That's a straight forward imperative language neutral non-Pythonic implementation. Iterate across the container until you find the item, and then delete it right away. Unfortunately, l.remove(item) is another O(n) function being called inside an iteration of the list. That can be fixed with the enumerate call:
for idx,item in enumerate(l): if item == s: del l[idx] break
This is better. The enumerate() call returns an index that allows us to use the del call which takes only O(1) time.
Although that'd work, let's try to find a more modern and efficient solution. Use the enumerate call in a generator that returns the index of the item to be deleted:
idx = next((i for i,v in enumerate(l) if v == s), None) if idx is not None: del l[idx]
The generator call is very efficient, and maybe we're done. No we're not! You see, it's a trick question. The actual answer is:
DELETE FROM l WHERE f = s LIMIT 1;
When it's your own project, you can change the domain! The data never had to be in a container that Python could process. Why not store it in a SQL database? Or, hey, just for grins, why not keep it in Python, but why was the container accessed like an unordered list? Was it ordered? Then do a binary search.
idx = bisect.bisect_left(l, [s, ]) if idx != len(l) and l[idx] == s: del l[idx]
Nice, O(log n). Or, wait. Maybe it doesn't have to be ordered. Each photo has a unique filename. That's a key. Each photo's data could be a record in a Python set. Python has a "set" container that resembles mathematical sets with all the performance features you'd hope for. So let's make "l" be a set, and "r" be the row that has s in it, then...
Yay, that'll usually occur in constant order time! So have we found the best solution?
Of course not. We can make the theoretical problem as simple as we like. But in practice, the web album sometimes wants a database with different primary keys, sometimes it wants an ordered list of items, and sometimes it only needs a set of unique items.
For that matter, sometimes my user will be on a desktop computer, sometimes they'll be on a tablet, and often they'll be on their phone. That's a lot of CSS to experiment with.
That's what fiddling with your own personal project is all about! Dive deeply into whichever problem piques your interest at that moment. Make something work better. Even if the rest of the world doesn't know why you bother. Sometimes it's just what you need to be doing.