I made a web page that predicts where I'm going. It was a fun little academic exercise, and I thought some of the challenges were interesting.
Ever since it came out that Apple was tracking and keeping location data on everyone's iPhones, and owners could get to that data my curiosity was piqued. Apple was quick to stop tracking so much data and to limit access to the database, but I saved off my phone's database of locations while I could. Later, I installed a new app, OpenPaths, that intentionally and continuously logs my phone's locations and makes that data available to me.
Now that I was logging my phone's locations in the background, I could ask myself what I wanted to do with the data. I knew what I wanted to do right away! I wanted the computer to answer the following question:
Given where I've been, where am I probably going now?
I'd make a webpage whose only job it was to display its answer to that question. It's a simple web page to look at, but the devil's in the details.
The human brain is wonderfully better at answering that sort of question than the computer. It's a matter of pattern recognition across at least three dimesions: time and two-dimensional space.
I started with the way I'd think about answering that question:
If there aren't any notable exceptions, like travelling for work or vacation, then I follow a bi-weekly schedule more closely than a weekly schedule. So I'd look at where I was at this time of day two weeks ago.
To translate that into an algorithm for the webpage a few things need to happen. I have to codify what a "notable exception" is. Perhaps it's being more than 100 miles away from home for more than a day or two. Or, if I'm currently away on vacation, then the program shouldn't be looking at what I've been doing two weeks ago at home.
Here's the algorithm that the computer uses:
- If I was near here two weeks ago, consider where I was going back then at this time.
- Otherwise, consider where I was at this day of the week last week.
- If I was away on each of those occasions, then how about where I was yesterday at this time?
Simple enough. But the question of "two weeks ago at this time of day" itself is a bit ambiguous. Two things get in the way of that that the human brain just automatically figures out. One, time-of-day is local. If I am in California today, but I was in Hawaii two weeks ago, I can pretty easily calculate "breakfast time" for either. But the computer would have to first translate latitude and longitude coordinates into time-zones on the Earth. Then it'd be able to calculate relative time-of-day for either week at either location. The other issue, daylight savings time throws off the way a program might naïvely calculate "two weeks ago."
Account for daylight savings time
"Two weeks ago at this time of day" is a loaded phrase. The naïve approach for a computer that keeps track of time by incrementing seconds would be to subtract the number of seconds in a day, and do that 14 times. But that doesn't account for daylight savings time, which would throw off the results for 4 weeks out of a year.
My program uses Python, which has a library to translate time from epoch timestamps (which are used by OpenPath's library) to a calendar date and time-of-day format that's more native to the human mind. So the actual code calculates "number-of-seconds to this time-of-date two weeks ago" as follows:
now - time.mktime( (datetime.fromtimestamp( now ) - timedelta( days = 14, hours=0 )).timetuple() )
Huh, that's sort of wordy compared to the naïve alternative, but the important thing is that this approach is always correct. Once we've figured out when "two weeks ago" actually is, then we can calculate what "where" and "how far" actually mean...
If you're near the equator, then calculating short distances using latitude and longitude can be approximated by an equation based on Pythagorean Theorem. What's funny is that if you search the web looking for the algorithm in Python, you'll usually see something like the following function:
def distance(p1, p2): return math.sqrt((p1 - p2)**2 + (p1 - p2)**2)
But as long as you're importing the math module anyway, it'd be even more direct if you just used math's own "hypot" (short for "hypotenuse") function:
def distance(p1, p2): return math.hypot(p1 - p2, p1 - p2)
But that calculates planar distance, and we're not on a plane. We're essentially on a sphere. So it's better to use the Haversine formula if we want to get an accurate distance between two points defined by latitude/longitude coordinates.
Now that timedelta and the Haversine formula handle the "when" and "where" in my fuzzy algorithm, it's time to take a look at the presentation of the data.
The Webpage Itself
So much for the algorithm. What about the quality of the webpage itself?
It's a small webpage. So it's only sensible and intuitive that it'd be a quick and responsive webpage, too. But the webpage wouldn't work without making relatively long queries to two remote services.
- Retrieve new location data from the remote OpenPaths API service.
- Retrieve specific map data from the Google Maps API service.
In between the two big remote queries, the program needs to perform the actual prediction for where I'm going to be based on the OpenPaths data, and send the predicted points to Google Maps. There's no way to avoid the fact that the webpage is going to take a few seconds to do all its work.
The best work-around for that is two things:
- Ajax. The web server can quickly serve a simple HTML web page to the client which'll get displayed for the user right away. Then the browser can make another request to the server for just the data that takes a long time to calculate and retrieve.
- Caching. Once I've retrieved raw datapoints and made the prediction calculations, then that prediction shouldn't change for a few minutes. I can save off my prediction and immediately hand it back the next time the webpage is requested, if it's requested relatively soon.
It's critical to me that the webpage be small and simple. It has to get to the point as quickly as possible. But it's also important to me that I give credit to the tools and services I used to make it possible. That called for some credits to be put in a footer.
I wanted the footer to be relative to the browser's window viewing the page. But if that window was too short, then the footer would end up overwriting or being overwritten by the map or the text above the map. The fix for that was some clever CSS that put the footer at the bottom of the window, but never let it cover up the important part of the page, the map.
So far so good. But then I discovered something unexpected...
Sadly OpenPaths seems to collect bad data from my phone occasionally while it's at rest. All of the recorded and predicted movement in the map below is due to bogus data from OpenPaths.
All of the points along the same angle that extends to the south east are bad data. The phone didn't go anywhere all that time. I have no idea why it sometimes pretends to travel to that part of town, but I don't like it. This called for another interesting algorithm that's better suited for a human brain:
If the datapoints smell fishy, don't use them.
It's really easy for me to detect which datapoints are bad, and not only just because I know where my phone's been. It's because there's a certain pattern, the angle and distance traveled by the bogus points. So I've got a work-in-progress algorithm the elides points that smell fishy.
A Secret Mode
The main point to the site was the prediction. But as long as I had all this historical data, it seemed like a shame if I couldn't easily look it up, too. So there's a secret mode, impossible to find and discover. (Since nowadays people don't read long blogs or actually type in the URL bar of their browsers.)
If you add an HTTP "GET" parameter, t, to the URL, the website will return a corresponding location history instead of a prediction of where it thinks I'm going to be. t can take one of three different forms, a UNIX timestamp, an RFC 2822 date and time, or a negative number of days to look back. Here are some examples:
I flew in to Los Angeles on that day. Timestamps are good if you're already dealing with them or want a relatively short token to represent an absolute time. Otherwise, they're an epoch fail waiting to happen.
Took the kids to Disneyland that day. Fun! That date format is handy if you want to browse my location history and are thinking in terms of calendar dates.
What'd I do last week? This is handy if I don't need an absolute time and date, but just want an offset from the current time and date.
Finally, I've got a micro site that was really fun to build and with which I'm quite pleased.
An infuriating design decision of the Kindle Touch was to put the power button on the slick and slightly angled bottom of the device. This design decision confounds most attempts to stand up the Kindle on a hard surface and lean it against something so that the user can do his reading with his hands free.
Two things usually go wrong when you try to stand up an unmodded Kindle Touch for reading with your hands free.
One: The Kindle is likely to slide down, since the bottom surface of the Kindle is relatively slick.
Two: Once you balance the Kindle just right, you're balancing it on the power button, depressing it continuously, putting the Kindle into a reboot mode.
The fix for this is to cut little sections of vinyl bumper surface guards, and adhere them on the flat bottom of the Kindle Touch. The best part of the vinyl bumper to cut is the molding around the actual bumpers. It's just the right height to protect the power button from being accidentally depressed.
Even better, the vinyl has a high coefficient of friction, so the Kindle Touch won't slide down when you try to stand it up anymore. Both problems are fixed by this one simple mod!
I'm burying my parents' ashes this week. I miss them. I miss them individually, and I miss them as a pair. And my missing of them is an active, conscious thing, not a passive background thing. I remember them vividly, and I could use their advice and love now, if they were here. This is going to be a hard week.
It's been a hard couple of months.
I'm not sure what the difference between grieving and depression is. What I mean is, I don't know where "the line" is, and how not to cross it. I know that what happened is the natural order of things. We are supposed to survive our parents. I also know that I'm working through this.
Rie Fu's song, "Life is like a Boat," came on while I was playing a favorite playlist. It's a love song, but it also touches on the hardships of life, and working your way through them to the other side. In the song, we "are all rowing the boat of fate / the waves keep on comin' and we can't escape."
The next part resonates with me:
You make me wanna strain at the oars
And soon I will see the shore
When will I see the shore?
At the same time she writes about fate and not being able to escape it, she sings about never giving up her effort. She doesn't see the shore, but she'll strain at the oars in the hopes that she soon will.
I'm straining at the oars, too.
I wrote a dead man's switch to update some of my online accounts after I die.
What It Is
The basic idea is that if I pass away unexpectedly, I'd want my online friends to know, rather than for my accounts to go silent without any explanation at all. I wrote a program to take notice of whether or not I seem to still be alive, and once it's determined that I've died, it'll follow instructions that I've left in place for it. It'll do this over the course of a few days. Well, I won't tell you when it'll stop, that'd be taking some of the surprise out of it.
Two things caused me to do this. First, I wrote a lifestream. Essentially, I already wrote a computer program (a cron job, technically) that takes note of nearly everything I do online already. It was a handy thing to have, and it seemed like it could do just a little bit more with hardly any effort.
Second, I read the books Daemon and Freedom™. A character in those books also wrote a program (a daemon in his case) to watch over its creator's life, and then to take certain actions upon its creator's death. The idea got under my skin, and I just had to write a similar program of my own.
How It Works
This section will get technical, but it'll be of interest for those who also want to write their own.
Everything is in Python. The lifestream I have uses feedparser to read in and process each of the feeds affected by my online activity (sometimes called user "activity feeds"). It stores certain information in a yaml file. Here's an excerpt from the file itself.
- etag : 2KJcCROqtyI4nqaQEg34109rfx4 feed : "http://my.dlma.com/feed/" latest_entry : 1326088652 modified : 1326090583.0 name : mydlma style : journal url : "http://my.dlma.com"
The most relevant item in the file is that there's a field called, "latest_entry", and the data for that field is a timestamp. The latest "latest_entry" would then be the most recent time I've been observed doing anything online.
Given that, all I had to do was write a new script that watched the latest "latest_entry", and when it became too long ago, it would assume that something bad had happened to me. (Which would be wrong, of course, if I was merely vacationing in Bora-Bora, and didn't have internet access for a couple of weeks.)
This new script would do something like the following:
- Continue to step 2 if David hasn't done anything online for a few days. Otherwise keep waiting.
- Decide which posts to make at which times, and make note that those posts themselves don't now make it look like David's still alive and that the switch should deactivate.
Once the script thinks I've been offline for too long, it writes a cookie to file, and then goes from watching mode to posting mode.
In posting mode, the script looks over its entire payload of messages to deploy. I used the filesystem to maintain the payload, much like dokuwiki does. (Others might think that a database would be preferable. Sure, that'd be fine, too.) My payload files encode data into the filename, too. The filename is composed like so: [delay_to_post]-[service_to_post_to]-[extra_info].txt. That way, when I display a listing of the directory, I can see an ordered list of which messages go when.
Messages that go to blogging services like WordPress or Habari use the AtomPub API. Messages that go to other services generally use OAuth 2.0 for authentication, then use a custom API to deliver the message.
Once a message has been successfully delivered to its service, it gets moved or renamed in such a way that it's no longer a candidate to get delivered again.
The script runs as a cronjob, and usually just updates a status file. If it runs into a problem, it sends an email. (Fat lot of good that'll do if I'm already dead. But while I'm still alive, I'd like it to let me know if it's not happy.)
While I'm alive, I might still add posts to post later. When I add those new messages to post after my passing, I need to ensure that I didn't do anything wrong. (For example, the Habari payload is contained in an XML CDATA section, but the WordPress payload is plain XML, so I can't write any malformed messages.)
That's why as a part of routine maintenance, my dead man's switch also does payload data validation.
During script refactoring, I may want it to display certain diagnostic info directly to stdout. For that case, the script has optional debug, test, verbose and validate flags.
There's always the false positive, where it thinks I've died, but the rumour was greatly exaggerated. I'm actually looking forward to a few false positives, because they'll remind me that the dead man's switch is actually still running.
Another serious risk is that my dead man's switch relies on the successful continuous operation of my lifesteam script. There's an element to that lifestream script that degrades over time. Hopefully, I'll get around to mitigating that risk.
And yet another risk to my dead man's switch is continuously changing APIs. As I upgrade my Wordpress and Habari blogs, will they still accept AtomPub like they did when I wrote the switch in 2012? Will Twitter and Plurk still use the same OAuth protocol and API calls? Heck, will Dreamhost not upgrade Python to a version that's incompatible with my script?
Will I still have active accounts at the time of my passing? Will it then be illegal to continue to function online after you're dead? (Some bozo might die before me and do something stupid after he passes.)
There's a lot that could go wrong. But if these things don't go wrong, and my dead man's switch works correctly, that'd be pretty neat.
Have I got things to say to you!
Beware: I am a real neophyte when it comes to internet security. Having said that, I couldn't have fared any worse than Sony Pictures. They lost 1,000,000 plain-text passwords when a SQL injection vulnerability was discovered. I've been protecting against that attack since 2005. (At the part, "Is the password secure?" is where I say the passwords aren't stored in plain text. SQL injections have been the subject of security jokes for a long time, too. Ah, Little Bobby Tables.)
There have been and continue to be large breaches of personal data on the internet. Nathan Yau shares an infographic of the largest data breaches of all time.
My immediate family and I need a way to keep each other up to date with our changed account info and ID numbers. We need a solution that meets the following usability criteria:
- Accessible anywhere, from any device. It has to be practically just one click away.
- Trivial, memorable URL. We may be typing it directly into the URL bar.
- Always up-to-date. Any change made from anywhere is accessible immediately from any other client.
If it's not that easy to use, it won't be used, and there'd be no point in making it. On the other hand, it has to have the following security criteria:
- Accessible anywhere, from any device. It has to be secure even over a public wifi network.
- Secure from remote client attacks. It has to handle attacks over the internet.
- Secure from local attacks. It has protect against disgruntled hosting company employees.
With all that in mind, I've decided to roll my own information vault. Here are some goals and notes from that venture:
Be A Low Value Target
My first line of defense is that my information vault is just for me and my family. This'll never store enough data of real value to make it a target for the economics of it. I might get attacked, but it'd only be for the idle challenge of it.
Block Direct Access of Data Files
Move data files off the server, even though they're encrypted, or into directories tightly controlled by permission settings and .htaccess instructions. Test both attacks. If your encrypted files can fall into your attacker's hands, they can try a local brute force attack. (More on that below.)
Use HTTP Secure
For any data that is accessible, use HTTPS. This is the first line of defense if you want your data accessible over a public wifi network.
Unique and Long Master Password
Force your users to use a long random, impossible-to-guess master password. Prevent any sort of social attack: No names, dates, or places. In my case, since I'm the creator of the tool, I can do this.
Use a Hard-To-Compute Hash for the Master Password
Related: Do not store the master password anywhere. And the salted hash you use for it should be secure. Refer to this wikipedia article on cryptographic hash functions to see relative weaknesses of the functions. I've considered throwing in with a hashing algorithm that adapts to faster hardware to frustrate brute-force attacks.
Don't Store any Data in Plain Text
This is a defense against a local attack from someone who can obtain file-level access, like a company employee with admin access.
Sony Pictures stored private data in plain text format, and thus enabled this interesting analysis of passwords in the Sony Pictures security breach. Consider your encryption algorithm carefully. I used AES, but am keeping my options open. I can change my backend at any time.
Limit Cookie Scope
Limit your HTTPS cookie scope with morsels like max-age, httponly, domain, path and secure morsels.
While you're at it, it doesn't hurt to salt cookie and session data with an identifier associated with the request. In Python you could use os.environ['REMOTE_ADDR'].
Know what kinds of attacks can be performed. Encode characters that have special meaning for the languages you use, like the quotes characters, <, >, and &, among others. In Python, the bare minimum you'd use is cgi.escape for that, but you'd want to use other functions depending on where/how the data is travelling or being displayed.
Analyze and Act Upon Suspicious Activity
It's not enough that your server is passively logging each access. Your site needs to analyze recent activity and take action (like email you or ban certain origins) when preset triggers are tripped.
Security is not a product, but a process." --Bruce Schneier, author of "Applied Cryptography"
This blog entry may have already has fallen out-of-date with new measures I've taken to protect our information vault.
If I'm missing a vector of attack, or you have some practical advice for me, I'd appreciate hearing from you.