On predictability of seemingly simple code
Does the time taken to count integers (less than a threshold) in an unordered array, remain the same, or vary?
This post is a copy of the gist I posted in response to the WeekendDevPuzzle of 2022-02-05, shown below.
As the thread above has submissions from people who chose to share their perspectives on the puzzle, I'll be linking this post to this Twitter thread, to avoid polluting the original one.
Motivation for today's topic
We all have a mental model of how something works. For today's puzzle, the relevant mental model is: the same series of instructions should result in the same time of execution. The astute would be quick to add: in non-noisy environments only. And they'd be correct.
But then where's the flaw in this model? Let's start.
Dissecting the puzzle
The puzzle seems simple enough: a program reads in an array of integers, and counts how many of them are less than a threshold (another parameter to the program). We've been informed that there are no noisy-neighbour issues, and we're measuring only the counting loop time, and ignore all I/O stuff.
Given that there's no guarantee of order in the array, there's no way to avoid a full linear pass. We will need to compare every single integer with the threshold and increment the count accordingly.
On the face of it, given that we're always going to make a full linear pass, it would seem it'll always take the same time.
Thinking aloud
It's difficult to be objectively honest to "thinking aloud" when one is the designer of the puzzle, but I'll try to be as sincere as I can, as to how my mind was thinking, when I was coming up with the puzzle.
There are six potential drivers of differences that jump out to me:
- Any potential instruction/data caching that accelerates execution.
- Any pipeline hazard related to branch prediction (as we have a conditional logic here).
- Compiler optimisations
- Any JIT related optimisations that might be getting done in runtimes like Java etc.
- The number of times the count is incremented.
- CPU architectural differences.
At this point in time, I do see the answer, but given that it's a simple enough program, let's just write out the program and execute it. We can try explaining the behaviour as we execute.
Testing for caching differences
Any differences due to caching would be linked only with the size of our data, clearly. So let's explore with different data sizes. To generate different data sets, we'll use this python code
Let's start with something that fits into a standard 64k cache line. We do this by keeping dataset size to 15k integers (each with 4 bytes in size). Oh, and let's run this program in C++, just to avoid any JIT related optimisations for now. We're also specifying the times
parameter to be 10k, which loops the counting process that many times, so that we get a decent sized time interval.
Hmm, we're getting loop times of ~0.5s. Now, let's run the same on a dataset of 15M. To compensate, we'll reduce the times
parameter proportionately to 1k.
Almost the same as before. But is this really surprising? Given that we're going over the data sequentially, all we're doing is encountering a cache miss every 'n' bytes, where n is the size of cache line. It's still a very simplified picture of course, but you can see why the effect of cache is not very high.
Testing for branch prediction
If you don't understand pipeline hazards, or how CPU pipelines work, a high level picture is this: the CPU doesn't wait for instructions to complete sequentially, before initiating the next instruction. This is needed so that instructions that take, say 4 clock cycles, are not stalling the next instruction unnecessarily. So the CPU pipelines them into separate queues. If for any reason, this cannot be done, or needs to be scratched and restarted, things become slower. One of the reasons why this can happen is branching (if-else).
We can test for this by simply modifying the parameters that change the number of times the condition is true or false. e.g. setting the threshold to a very high int makes sure that the condition is always true. We can compare this to a scenario where the threshold is such that true/false flip quite often (this prevents branch predictor from being effective, so more pipeline hazards encountered).
Let's do this!
What?! A ~3.2x difference between perfect branch prediction vs poor branch prediction! I'd be honest, every time I encounter extremes of branch prediction, I know what to expect, but it never ceases to amaze me.
So, clearly, if the threshold parameter is such that the condition flips very few times in consecutive iterations, the loop will execute much faster.
Compiler optimisations
Compilers may use vector operations to optimise executions, or use conditional instructions that sit well with pipelines (e.g. CINC
for ARM). Which compilers enable what through which flags, is something I'll leave for exercise.
JIT related optimisations
I tested with Kotlin, and at least in my tests, the JVM produced similar difference in results (>3x). I was a little surprised, as I was expecting the JVM to produce vectorised instructions for faster execution. I'm not digging deeper into this for now, but if you do find more details, please do share in comments below.
On Python, you won't see any difference between thresholds, if you were to use the standard python, as it runs mostly as an interpreter. However, if you were to use a JIT python (like pypy3), you'd see very similar differences as above. Did you notice that in the puzzle, I'd shared Python code as a sample? This was a sort of cheating I did for lazy ones out there :)
Number of times count is incremented
One may think that there might be a small difference between scenarios where the condition mostly evaluates to true vs false. After all, we're incrementing a variable on every true. Yes, but incrementing the variable on every true, does not mean that a memory address is being written to every time. Incrementing a register (or pipeline intermediate registers) is super fast, and hardly makes any difference. My tests seem to confirm this (except when you're using a Python interpreter sort of thing).
CPU architectural differences
I expect the difference to be lesser in case of ARM (numbers above are from x86-64). I'll update this once I run the tests.
Conclusion
So, as you can see, we can get substantial difference in run times, for the same dataset, based on different threshold values. This was first called out in unambiguous terms by Smit Hinsu in this tweet, though others had been discussing branch prediction prior to this too.
This puzzle is more than just a theoretical exercise. It tells us that program executions can be greatly affected by the inputs we receive at runtime, even in the simplest of cases (like a loop of conditional increments).
Hope you had fun! If you've got a comment to share, head over to the gist of this post