Mathish - Properties of Code: Functional Complexity

Properties of Code: Functional Complexity

About a year ago, I began giving some serious thought to an article named Functional Complexity Modulo a Test Suite by Reg Braithwaite. Today, I think I have something to say on the matter.

Background

Suppose you’re a programmer embracing the trends of test or behavior driven development. You have a problem to solve and expectations on how the solution should behave. So, you think about the problem, enumerate the behaviors and write some tests to model this version of reality. You start writing code to satisfy these tests. Red becomes green, you refactor clumsy first drafts into terse and expressive statements that are almost a pleasure to read, when without warning, four years of undergraduate mathematics grab ahold of your brain and thrust you into theoretical domains. Your output of practical, working code is halted for the day, while you begin contemplating ways to measure “readability”, “maintainability”, “complexity” and how these ideas fit within the framework you have spent the past few years operating within. You immediately regret taking that second major in mathematics, but it’s an integral part of you now. You know you can’t run from it, so you do your best to appease it by firing up your favorite text editor, ensuring that your blog is running MathJax, and whipping out your $\LaTeX$ fu!

Some Preliminary Definitions

The following definitions have been directly derived from the original article linked above.

For our purposes, a program, $p$ , is a function of inputs that returns some kind of output. We will use $\mathcal{P}$ to denote the set of all programs.

A test, $t$ , is a function that maps a program, $p$ , to either 1, for true, or 0, for false: $t : \mathcal{P} \rightarrow \{0, 1\}$ .

A program, $p$ , is said to satisfy a test, $t$ , if and only if $t(p) = 1$ . We will represent this relationship with the symbol $\models$ :

$p \models t \iff t(p) = 1$

A test suite, $\sigma = \{ t_1, t_2, \ldots, t_n \}$ , is a set of $n$ tests. We denote the number of tests in a suite as $|\sigma| = n$ . A program is said to satisfy a test suite, $\sigma$ if it satisfies each of the tests in $\sigma$ . More formally, $p \models \sigma$ if, and only if $\forall t \in \sigma, p \models t$ .

We say two programs are functionally congruent modulo a test suite when they both satisfy the same test suite. When there is no danger of ambiguity, we may simply refer to two (or more) programs as being “congruent.” We represent this notion of congruence as follows:

$p_1 \equiv p_2 (\bmod \sigma) \iff p_1 \models \sigma, p_2 \models \sigma.$

Consider all of the programs that might satisfy the test suite. There are an infinite number of programs in this set (a trivial proof of this statement can be found in “Example Programs and Calculations.”) Now, let’s suppose we have a metric, $|p|$ , that measures the size of a program. As a trivial example, let’s suppose $|p|$ is the number of characters in the string representation of program $p$ . We could use a more interesting measurement, but for now let’s stick with the simple “string length” measurement. We take this measuring stick and apply it to each of the programs in the set of programs that satisfy a given test suite, and record the “shortest length” measured. This “shortest length” is the satisfaction complexity of the test suite. Given a metric for a program $|p|$ , the satisfaction complexity of a test suite, $D_{\sigma}$ , is given by,

$D_{\sigma} = \min \left\{ |p| : p \models \sigma \right\}$

And finally, we define the functional complexity of a program, modulo a test suite, $F_{\sigma}(p)$ , to be the satisfaction complexity of the test suite, $D_{\sigma}$ , if and only if the program satisfies the test suite. Alternatively,

$F_{\sigma}(p) = \left\{ \begin{array}{ll} D_{\sigma} & \text{if $p \models \sigma$} \\ \infty & \text{otherwise} \end{array} \right.$

The important point to take away from this is that every program that satisfies a given test suite has the same functional complexity relative to that test suite.

Example Programs and Calculations

Suppose we want a program that takes the sum of all of the integers from 1 to $k$ . We begin by writing some tests (note: I am going to make use of a hypothetical assert function that returns 1 if the block it is given evaluates to true, and 0 otherwise):

def test_1 program
  assert { program.call(10) == 55 }
end

def test_2 program
  assert { program.call(6) == 21 }
end

def test_3 program
  assert { program.call(83) == 3486 }
end

Our test suite consists of three tests, each checking that our program produces the correct sum for distinct values of $k$ . As mentioned earlier, there are an infinite number of programs that can satisfy this test suite, and here is the trivial proof:

def sum_with_while k
  i = 1
  sum = 0
  while i <= k
    sum += i
    i += 1
  end
  sum
end

def sum_with_while_1 k
  unused_local = 1
  sum = 0
  i = 1
  while i <= k
    sum += i
    i += 1
  end
  sum
end

def sum_with_while_2 k
  unused_local = 2
  sum = 0
  i = 1
  while i <= k
    sum += i
    i += 1
  end
  sum
end

Hopefully, the pattern is obvious: the program sum_with_while_<x> will set unused_local = <x>. This assignment adds nothing of value to the program, but the program still returns the appropriate result, and thus satisfies the test suite.

Now, let’s consider a few non-trivial variations:

p1 = lambda do |k|
  i = 1
  sum = 0
  while i <= k
    sum += i
    i += 1
  end
  sum
end

p2 = lambda do |k|
  a = b = 0
  while a < k
    b += (a += 1)
  end
  b
end

p3 = lambda do |k|
  (1..k).inject { |sum,i| sum + i }
end

# Only when Symbol#to_proc is available (eg: Ruby 1.8.7+)
p4 = lambda do |k|
  (1..k).inject(&:+)
end

# Only if Enumerable#sum is available (eg: active_support)
p5 = lambda do |k|
  (1..k).sum
end

p6 = lambda do |k|
  k * (k + 1) / 2
end

p7 = lambda do |k|
  case k
  when 10 then 55
  when 6 then 21
  when 83 then 3486
  end
end

Each of these programs, $p_1 \ldots p_7$ , satisfy our test suite. Programs $p_1$ and $p_2$ are very similar, but with some variables renamed and some operations switched about. Programs $p_3$ , $p_4$ , and $p_5$ are also similar: $p_4$ removes some verbosity by using Ruby’s Symbol#to_proc method while $p_5$ makes use of ActiveSupport’s Enumerable#sum method which in turn calls inject. The program $p_6$ calculates the sum analytically without iteration while program $p_7$ provides results only for the values tested for.

We will ignore the variable assignment and line ending characters when calculating the length of these programs. For example, when calculating the length of $p_6$ , we count only the characters in “lambda do |k|” (13 characters), “ k * (k + 1) / 2” (17 characters, including the two leading spaces), and “end” (3 characters), for a total of 33 characters. We ignore line ending characters (eg: \n) because they aren’t readily visible. Why make the process of verifying these numbers more tedious than it already is?

Below are the lengths of each of the programs based upon this method of counting characters:

$\begin{aligned} |p_1| & = 78 \\ |p_2| & = 65 \\ |p_3| & = 51 \\ |p_4| & = 36 \\ |p_5| & = 28 \\ |p_6| & = 33 \\ |p_7| & = 81 \end{aligned}$

We see that $p_5$ is the shortest of our programs, weighing in at 28 characters. One could argue that the size of $p_5$ is misrepresented because the sum method is not a native Ruby method, and we could address that concern by adding the length of the definition of sum to $|p_5|$ . When measured in the same way as our programs, the Enumerable#sum method found in ActiveSupport 3.0.7 weighs in at 112 characters, so let’s tack that on, $|p_5| = 140$ . Ensuring that the metric accounts for the program’s definition as well as any external dependencies keeps our measurements meaningful. Otherwise, we could create a very small program that satisfies our test suite:

p8 = lambda do |k|
  p1[k]
end

If our notion of size does not account for the size of all external dependencies, our metric (and that which we intend to build upon it) loses nearly all utility.

Taking into consideration the adjustment made to $|p_5|$ , the “smallest” example program that satisfies our test suite is $p_6$ . It is entirely possible that there exist programs even smaller that also satisfy the suite, but given that $p \models \sigma$ we can conclude

$\begin{aligned} D_{\sigma} & \leq |p_6| \\ F_{\sigma}(p) & \leq |p_6| \\ & \leq 33 \end{aligned}$

So, we only have an upper bound on $D_{\sigma}$ ? Close enough!

Kolmogorov Complexity and Functional Complexity

One way of measuring the complexity of a string is by measuring its Kolmogorov complexity. A quick overview of the process, taken straight from the linked Wikipedia article, is to take a string:

abababababababababababababababababababababababababababababababab

and search for a smaller representation of the it:

ab repeated 32 times

The first string has 64 characters, the second string has only 20. Our simplification of the original string may not be minimal, but it is substantially smaller, which suggests that the original string was not very complex. The actual Kolmogorov complexity of a string is the size of its minimal representation in some fixed universal description language. Provided that our language of choice is Turing complete, our measurements will vary from some other choice in language by a fixed constant. So, let’s pick Ruby.

'ab'*32

We now have a representation of our original string that is only 7 characters long. This representation may still not be minimal, but as with Functional complexity, we now have an upper bound — the Kolmogorov complexity of the original string in Ruby is at most 7. It’s been a while since I’ve thrown down some $\LaTeX$ , so if we let $K(s)$ represent the Kolmogorov complexity of a string, $s$ , and $d(s)$ represent a minimal description of $s$ in Ruby, then:

$\begin{aligned} K(s) & = |d(s)| \\ & \leq 7 \end{aligned}$

So, unsurprisingly, or original string is really not that complex, it can be greatly compressed, and it is very “un-random.” All three of those statements are roughly synonymous. Now, what is the relationship, if any, between Kolmogorov complexity and Functional complexity modulo a test suite? Kolmogorov complexity deals with individual strings whereas Functional complexity deals in programs that satisfy a test suite, so to find a relationship between the two, we need to get them both working in the same domain. In our example test suite, we have a pretty simple mapping between input and expected output that can be represented by many different strings. Here’s one:

"10=55,6=21,83=3486".split(',').each_with_index do |io, i|
  i, o = io.split('=')
  define_method :"test_#{i+1}" do |p|
    assert { p[i.to_i] == o.to_i }
  end
end

This test suite builder weighs in at 159 characters (excluding \n characters) compared to 169 characters in our original test suite. We could use a regular expression instead of multiple calls to String#split:

"10=55,6=21,83=3486".scan(/(\d+)=(\d+)/).each_with_index do |(i,o), k|
  define_method :"test_#{k+1}" do |p|
    assert { p[i.to_i] == o.to_i }
  end
end

to bring our length down to 147 characters. The best part of this little excursion is that none of the numbers I’ve thrown at you are important, I just like to count things. What really matters is that you now see how 10=55,6=21,83=3486 serves as a complete representation of our original test suite. We want to measure this string’s Kolmogorov complexity.

None of the programs we’ve written to satisfy our test suite will generate the string we’re now after, so we need to wrap them in a loving adapater:

adapter = lambda do |p|
  [10,6,83].map { |i| "#{i}=#{p[i]}" }.join ','
end

Excluding line endings — Do I need me to keep repeating that? Let’s assume it’s implied from here on out — our adapter is 63 characters long. We can now represent one 20 character string as a string of at least 63 characters. In terms of compression, we’re doing it wrong, but fret not, for soon things will get better. In the meantime, we now have a program that takes old programs and turns them into new programs capable of producing the string we seek: $p^{\prime} = A(p)$ . Taking $p_2$ as an example, we can do it all in Ruby with roughly 128 characters. it is clear that if $p \models \sigma$ , then $p^{\prime}$ generates our desired string.

Now comes the improvement to our compression fail: when measuring Kolmogorov complexity, we are free to pick our language. Instead of using Ruby, we could use Python or Pascal. We are also free to pick a superset of Ruby, say Ruby + adapter. We’re going to be little pickier than that, though. Our language of choice is the one where every program written is fed into the adapter. With this restriction, the program:

lambda { |*_| "10=55,6=21,83=3486" }

will not produce the desired string. Instead, it will be wrapped by adapter and evaluate to:

10=10=55,6=21,83=3486,6=10=55,6=21,83=3486,83=10=55,6=21,83=3486

Thus, if our wrapped program, $p^{\prime}$ , produces the desired string, then our original program satisfies the original test suite. Combine this with our earlier statement, and we’ve got an equivalence:

$p \models \sigma \iff p^{\prime} = s,$

where $s$ is our encoded test suite 10=55,6=21,83=3486. Thus, given a minimal description, $d(s)$ , of our encoded test suite $s$ in this adapter wrapped Ruby language, we can infer that $d(s) \models \sigma$ (perhaps with the help of eval.) We also know that $D_{\sigma} \approx d(s)$ , since $d(s)$ is minimal in length. So, $|d(s)|$ gives us our Kolmogorov complexity and our Functional complexity*.

From our previous excursions in counting characters and whitespace (but not newlines!), we see that all of the programs that satisfy the test suite are longer than the 20 characters. Bummer, we still fail at compression. But what happens if we add a few more tests? Let’s expand our encoded test suite to the following:

10=55,6=21,83=3486,99=4950,1019=519690,9001=40513501,15146=114708231

Now our desired string is 68 characters long. We know that $|d(s)| \leq |p_6| = 33$ , and we have finally un-failed at compression! In addition to finding a representation for the string that is half as long, we also kicked program $p_7$ out of the set of programs that satisfy our test suite. Keeping with the spirit of how $p_7$ satisfies the test suite, we can get it back in line with the following change:

p7 = lambda do |k|
  case k
  when 10 then 55
  when 6 then 21
  when 83 then 3486
  when 99 then 4950
  when 1019 then 519690
  when 9001 then 40513501
  when 15146 then 114708231
  end
end

In adjusting $p_7$ to satisfy the new test suite, we have increased its length to 175 characters. So when our test suite was 20 characters long, $p_7$ satisfied it in 81 characters. When the test suite grew by 48 characters, $p_7$ was forced to grow by 94 characters. Changing a test changes a program… I smell a potentially useful metric in there, and we will dig deeper into this next time.

The chosen problem of summing the integers from 1 to $k$ is certainly a trivial one, and because of its simplicity, encoding the test suite as a string was also pretty easy. However, with the right adapter we could encode any test suite as a string even if each test spends a lot of time setting up the initial state before verifying its expectations.

The goal here was to show a relationship between Kolmogorov and Functional complexity measurements, so we tried to keep the components simple and manageable. In doing so, we were able to find a direct relationship between the two by considering a description language that, informally, is the result of augmenting our host language (Ruby) with the expectations of our test suite (this is how the adapter was constructed.) With a dash of mathematics and a pinch of hand-waving, we have shown that if Kolmogorov complexity is a meaningful measurement of a program then so to is Functional complexity modulo a test suite.

A Taste of Things to Come

I had intended to explore the relationship between these measures of complexity and somewhat vague notions such as “readability” and “maintainability” in this article, but it’s already quite a bit longer than I had anticipated. I will definitely be talking about “maintainability” next time: I believe there is a relationship between how much code must change when test suites are modified, but I need some time to think more about the math (I may have to whip out calculus or difference equations.) I would also like to explore “readability,” though I may refer to it as “comprehensibility” instead, by inverting a lot of the work done in this article, but I may have to turn this series of articles into a trilogy to do so. Stay tuned… or don’t, I’m still going to write it anyway!

Foot Noted

Note: Approximate Functional Complexity

I played a little fast and loose with notation earlier. The description string, $d(s)$ , is just that a string. As I mentioned, we may need the help of eval to map it from the world of characters to the domain of programs, so $D_{\sigma}$ may be a few characters longer than $|d(s)|$ . However, the number of additional characters is constant, so I’m okay with saying the two measures of complexity are roughly equivalent. [ jump back ]