This blog post is based on a set of challenges presented at SecTalks Canberra. You can have a go at solving the challenges here – this post will have some spoilers. 

Here’s a trick question: if I have a negative number, and I multiply it by negative one, will I always get a positive number? If we’re talking about the real world, then the answer seems obvious: of course I will. But when it comes to computers, things get a bit trickier; and sometimes I can get a negative number. Welcome to the weird world of data types.

Us software developers use basic data types every day: numbers, text, dates, times, etc. They’re neat abstractions that make our lives significantly easier, as they save us from having to spend all our time dealing with raw 1’s and 0’s. But many of the basic data types that all programmers use every day are a bit “rough around the edges” so to speak: they don’t represent their real-world counterparts perfectly. This can cause some problems when we forget about these limitations. Looking back over the bugs that I have found in my own code, many of them were because I had made an assumption that the abstractions “just worked”.

Integers

Take integers for example. Many developers are aware of integer overflow: because of some (very sensible) limitations of computer memory, there’s a limit to how big a number we can store. Once we go above that limit, because of how numbers are represented in a computer, it essentially loops around and becomes negative:

But integer overflow gets weirder. If I have a negative number, and I multiply it by negative one, I would expect that it should always become positive. But again, computers do not live up to their real-world counterparts.

The Integer type has a quirk whereby its range of valid values can go slightly further negative than it can go positive. The maximum integer value is 2,147,483,647; but the minimum value is -2,147,483,648. So what happens if we try to multiply the minimum value by -1? The answer, again, is overflow, and it will actually end up back at -2,147,483,648. So any code that takes the negative of a negative number, and assumes the result will be positive, may be buggy.

If it seems like this is being pedantic, it is important to consider things from an attacker’s point of view. An attacker doesn’t care about the billions of cases where the code works correctly. They are interested in the one case where it doesn’t.

A term that I like to use for this class of quirk is “Leaky Abstraction”; a term coined by Joel Spolsky. These data types are essentially claiming “You don’t need to worry about how I’m implemented – just treat me like the real thing”. And 99.99% of the time, that will just work. It’s these edge cases that we forget to deal with that can often create bugs that attackers can leverage.

Floating Point

Another oft-used data type is floating point numbers, and there is some unexpected behaviour here too. A relatively well-known quirk with this representation is called “floating point error” – essentially, very small rounding errors start to add up when doing maths, and we end up with answers that are ever-so-slightly incorrect. Say we take a number, and divide it and multiply it so that it ends up back where we started:

But the floating point type has some lesser-known quirks:

In this code, we compare two numbers; X and Y. If our first two checks fail, then the only logical conclusion must be that that X is equal to Y.

Except that there is another possibility that is rarely considered: either X or Y (or both) might have a special value: NaN. “NaN” (Not a Number), comes as a result of certain mathematical operations, such as 0 divided by 0, or the square root of negative one. The behaviour of NaN is that any time you compare it to something else, the result is false. In fact, not even NaN is equal to NaN.

Now so far, the leaky abstractions we’ve looked at have been a result of sensible memory and speed tradeoffs. The computer scientists who came up with these abstractions could have decided to make it so that integers and floating point numbers didn’t have these limitations – but as a result, calculations would have been significantly slower, and memory usage would have been significantly higher.

But there’s another reason that our abstractions do unexpected things; and that is the inherent complexity of the concepts that they try to represent.

Time

Take time for example. We’d like to think that time always goes forwards – and of course in a purely physical sense, it does. But our human-created concepts of measuring the time are rather quite complex. The current time can jump significantly just by walking a few metres (time zones). The current time sometimes goes backwards in a predictable way (daylight savings) – although governments often decide to change their daylight savings rules, so hopefully you’re keeping up-to-date with that. Sometimes time can jump forwards or backwards in a less predictable way, as computers synchronise their clocks with each other. Then there are leap seconds. And if you’re dealing with relativistic time dilation, well, good luck to you.

As we represent these concepts in our computers, we’ve tried to deal with most of these things – but we need to be careful. I often see this sort of code measuring the amount of time that a calculation takes:

In many programming languages, this style of code is buggy. The vanilla implementation of dates is often not aware of any of the complexities mentioned above. If this runs when daylight savings switches over, it will be inaccurate by an hour. Unless you’re on Lord Howe Island, in which case you’ll be off by half an hour. This also doesn’t deal with any NTP time synchronisation, or leap seconds.

There are ways to deal with these concerns: duration measurements should use a timer not dependent on the system time, and 3rd-Party libraries like Joda-Time and Noda Time can help you deal with complexities around time zones. But the important point is that these abstractions are often more complex then it might first appear, and using them without deeply understanding them can lead to bugs.

Text

Written language is another thing that is incredibly messy; and our best attempt at representing it, Unicode, is necessarily complex. For us in the English-speaking world, it can be easy to miss this complexity, as the most complex thing that we have to deal with regularly is probably capitalisation. But even capitalisation works differently in different languages. In addition, many languages have diacritic marks; and some languages are written right-to-left. In Arabic scripts, letters can join together; so the act of adding a letter to the end of a word can change how the previous letter should be rendered; meaning the text can appear shorter as a result of adding characters. In Mongolian (and emojis for that matter), the same letter can have multiple representations. Unicode attempts to deal with all of these complexities. Plus there are areas of Unicode that are reserved to be vendor-specific. So the Apple logo shares the same Unicode representation as a letter from the Klingon alphabet, and which one you see will depend on the computer you’re running on. And this is before we even get into the multiple ways in which you can encode Unicode; the most commonly-used of which take up different amounts of space for different characters.

All of this complexity is just asking for attackers to find novel ways to interact with software dealing naively with Unicode. And the big companies aren’t immune – just look at all of iOS’s many “receiving this text will make your phone crash” bugs; most of which were caused by poorly handling edge cases in Unicode.

Modern technology is unbelievably complex, and abstractions are absolutely necessary for technical practitioners to be remotely productive. But having a deep understanding of the abstractions we use is important for creating robust software.

Credit to Jon Skeet, whose “Humanity: Epic fail” talk got me thinking about some of these quirks.