Lately, I’ve been growing increasingly obsessed with this problem. While my solution is very fast (O(n) is pretty fast!), I’ve been concerned about a few possible issues with it:

- First, I wasn’t certain it was anything better than locally-optimal. It’s guaranteed not to produce a
*bad*result, but will it produce a*good*result? I couldn’t be sure. - Second, it relies on floating-point arithmetic, while most other solutions don’t.

It has the nice upside of being able to operate in constant space (not including the O(k) output space), and linear time, but those two caveats are potentially problematic. If it turned out to be a really *bad* solution, who’d use it? And the floating-point numbers felt too fuzzy to be safe. So I started poking at it again.

## Fixing the Floats

I started over the last few days with the floating-point number problem, and I realized that the DDA (digital differential analyzer) really doesn’t need them to operate correctly. It doesn’t need infinite precision: The floats are really fractions, simple rational values, not arbitrary real numbers; and the denominator is small (numpartitions * 2). So if we were to scale the fractions to become integers, we’d only need integers large enough to hold the total sum times the number of partitions times two.

So I coded up a simple replacement for the original code that solves it in pure integer math. I made the first version of that in C#, and then to show that there’s *nothing* special or magical about it, I coded up a pure C version that invokes no functions and uses nothing internally but integers and basic arithmetic (not even division!).

Nothing up my sleeve, presto — !

/* * Generate linear partitions of the numbers in 'values' of count 'numValues', * and write the start indices of each partition to 'partitions'. * * This uses a DDA approach to slice up the values so that there are * 'numPartitions' partitions of nearly equal size; the result is locally optimal * (but not necessarily globally optimal), and requires only two total passes * over the data, and requires only six total integer local variables as storage, * and only addition, subtraction, multiplication, and if-statements and loops, * which means this compiles to very efficient assembly language and can run well * even on *extremely* resource-constrained hardware. */ void LinearPartition(int[] values, int numValues, int[] partitions, int numPartitions) { int i; int size, targetSize; int currentSum; int partitionIndex; int isFirst; targetSize = 0; for (i = 0; i < numValues; i++) targetSize += values[i]; targetSize *= 2; currentSum = 0; partitionIndex = 0; isFirst = 1; partitions[partitionIndex++] = 0; for (i = 0; i < numValues; i++) { size = values[i] * numPartitions; if (currentSum + size < targetSize || isFirst) { currentSum += size * 2; isFirst = 0; } else { partitions[partitionIndex++] = i; currentSum = size * 2 - (targetSize - currentSum); isFirst = 1; } } }

Two passes over arrays, and pure integers. I don’t think it’s possible to make a solution that runs *faster* than this does, and it produces exactly the same output as the previous version. (If you happen to already have the total sum of the values available, you can even skip the first pass over the data!)

## Statistical Analysis

But there’s still the question of *how good is it*? It’s “pretty good,” in that we know it makes locally-optimal decisions for every value to be added (see part 1 where I explained how it works), but is it *globally* optimal? Does it beat Dr. Skiena’s dynamic-programming solution, or at least match his solution for quality?

No.

That’s a shame, given how fast it is, but no, it doesn’t.

I implemented several different possible solutions against Jonathan Palardy‘s data set, all in C#, and I ran them to see what they’d do. I made a nice little wrapper that generated statistics so I could see how my solution compared to the others. And while my solution is hands-down the fastest, a few of the other solutions edge it out very slightly for quality of the result.

Here’s output from a recent run:

Linear partition by identical count: [ 2213, 1604, 1341, 1892, 1996, 2035, 1791, 1201, 1473, 1259 ] Partitions:10 (vs 10) Total:16805 Mean:1,680.50 (vs 1,680.50) StdDev:337.26 Min:1201 Max:2213 Worst:±532.50 Linear partition by greedy cutoff: [ 1455, 1510, 1450, 1543, 1348, 1384, 1488, 1672, 1594, 1607, 1606, 148 ] Partitions:12 (vs 10) Total:16805 Mean:1,400.42 (vs 1,680.50) StdDev:388.71 Min:148 Max:1672 Worst:±1,252.42 Linear partition by float DDA: [ 1839, 1616, 1703, 1583, 1490, 1947, 1672, 1594, 1607, 1754 ] Partitions:10 (vs 10) Total:16805 Mean:1,680.50 (vs 1,680.50) StdDev:128.44 Min:1490 Max:1947 Worst:±266.50 Linear partition by integer DDA: [ 1839, 1616, 1703, 1583, 1490, 1947, 1672, 1594, 1607, 1754 ] Partitions:10 (vs 10) Total:16805 Mean:1,680.50 (vs 1,680.50) StdDev:128.44 Min:1490 Max:1947 Worst:±266.50 Linear partition by integer DDA, C-style: [ 1839, 1616, 1703, 1583, 1490, 1947, 1672, 1594, 1607, 1754 ] Partitions:10 (vs 10) Total:16805 Mean:1,680.50 (vs 1,680.50) StdDev:128.44 Min:1490 Max:1947 Worst:±266.50 Linear partition by Skiena's dynamic programming algorithm: [ 1839, 1616, 1703, 1892, 1640, 1488, 1672, 1594, 1607, 1754 ] Partitions:10 (vs 10) Total:16805 Mean:1,680.50 (vs 1,680.50) StdDev:114.71 Min:1488 Max:1892 Worst:±211.50

For each algorithm, the first row lists the sums of the values in each partition. The second row shows:

- How many partitions were produced (notice that “greedy cutoff” gets this wrong);
- The sum of all values (a simple safety check to make sure none were dropped);
- The mean of the partition sizes (again, notice that “greedy cutoff” gets this wrong);
- The standard deviation of the partition sizes
- The minimum and maximum partition sizes
- And the worst difference from the mean.

“Identical count” and “greedy cutoff” are the same first two solutions on Jonathan Palardy‘s page, and perform identically to his results. They make a good baseline. They’re also pretty bad, as measured by their means and standard deviations.

Dr. Skiena’s dynamic-programming solution edges in with the best standard deviation and best worst-difference. One would expect this, since it’s a provably-optimal solution. But it’s also O(k*n^2), which is pretty awful for performance; imagine trying to linear-partition a million items — you’d never finish!

Palardy’s water-flow technique, which I didn’t try to reimplement, has an impressive standard deviation of 120.9 but it uses simulated annealing to get that: Again, that works well enough for small data sets, but it probably doesn’t scale to anything huge.

## Still, I’ll take it

So while my algorithm doesn’t match a couple of the better ones for *quality*, it knocks everybody else out of the park for *performance*, and its output is still *very* good. It manages to produce the same exact result as Dr. Skiena’s optimal version for seven out of the ten partitions, which isn’t too shabby for how simple it is and how fast it runs. It could realistically be used on a 16-bit microcontroller, or could have been used for text layout on the original Macintosh or an SNES; not much of anything else is going to compete against it for constrained environments like that.

It also has an interesting side benefit, in that it’s very nearly an *offline* or *streaming* algorithm because it only looks at each value once after summing them: If you’re able to calculate the total sum of the values upfront, you could actually perform reasonably-efficient linear partitioning on data that’s too big to fit in memory!

So in some ways, I’m a little disappointed that I didn’t beat the best output; but at the same time, for practical, real-world use, my algorithm might actually be a winner: It’s trivial to implement, it’s faster than everything else out there, and it produces output that — while not optimal — is at least highly-competitive with the best solutions.

In spending a bit of time thinking about it, I don’t think it’s possible to produce a better solution that only looks at each value individually, in isolation — this algorithm may be the best answer that can be found in O(n).

But I’m still eyeing that nice output that Dr. Skiena’s solution produces, and that crazy O(k*n^2) runtime he has. I *think* there might be a solution that can match his results faster than his solution can; but unlike my current, elegant O(n) solution, the kinds of techniques I’m thinking about for that are big and require complex secondary data structures. I *think* it might be possible to produce a globally-optimal solution in O(n lg n).

That said, for real-world use, I’m likely to stick with my pretty O(n) solution: It might not produce the *absolute* best optimal output, but it produces *really good* output, and it runs circles around everything else.

## Source code

The C# source code for the example comparison program above can be found in my GitHub account and is covered under the MIT open-source license. And like the previous blog posting, I hereby place all of the code in this blog post *into the public domain*, so use it as you see fit, no questions asked. Steal and enjoy 🙂