# Mathish

the two halves of my tasty brain

## Refactoring OnStomp

Lately, I’ve been giving some thought to a few issues with the OnStomp gem and how I want to address them. I’ll start by over-explaining the issues, and then wrap up with how I plan to address them.

### Handling IO Exceptions

In the base connection class, there’s a fair bit of exception rescuing going on, but there is a problem with it. For example, let’s have a look at connections/base.rb starting at line 209

begin
yield frame if block_given?
end
end
rescue Errno::EINTR, Errno::EAGAIN, Errno::EWOULDBLOCK
# do not
rescue EOFError
triggered_close $!.message rescue Exception # TODO: [omitted for now] triggered_close$!.message, :terminated
raise
end

The first rescue handles exceptions that can be raised when reading data would cause the system to block. We handle this situation by doing nothing and waiting until later to try reading again. The second exception we address is an EOFError, which isn’t necessarily an error. For example, if the client has told the server it is all done and intends to disconnect, the server may shutdown the connection which can cause this exception to be raised. Even when an EOFError is unexpected, it always tells us that it’s time to shutdown the connection and move on.

This brings us to the final rescue block, which handles every other kind of exception. The system responds by shutting down the connection, firing a terminated event, and then re-raising the exception. I chose to re-raise the exception to give additional feedback to developers, which might have been handy if it weren’t for the fact that the exception is being raised within a separate IO processing thread. Thus, the exception can only be rescued when the thread is joined, which doesn’t happen until the developer has called disconnect on the OnStomp::Client object. By that time, it is far too late to handle the error in any meaningful way.

So, it’s pretty obvious that re-raising exceptions in this fashion is useless, now I want to show you why it’s also problematic. Suppose that we’ve got this bit of code:

# Pull messages from a file, database, etc.
messages = get_messages_from_persistent_storage

client.transaction do |t|
messages.each do |msg|
t.send msg.destination, msg.body
# Delete the message from file, database, etc
remove_message_from_persistent_storage m
end
end

client.disconnect

If an Errno::ECONNRESET exception is raised before all of the messages have been sent, the developer won’t know it until client.disconnect has been called. Now, from the server’s perspective everything behaves as expected: the transaction starts with a BEGIN frame, some messages are received, and the connection is closed before a COMMIT frame is received, so the received messages are discarded. The developer, on the other hand, gets screwed: the exception is raised in a separate thread from where the transaction block is evaluated. All of the messages have been deleted from the persistent storage and not one of them has actually been accepted by the server, which is a dick move on my part. Now, to be fair, I never intended the transaction block to work in this way. The intent of transaction was to deal with exceptions generated within the block itself and either commit or abort the transaction automatically. In other words, if remove_message_from_persistent_storage raised an exception, the transaction block would be there to rescue it, send an ABORT frame to the server, and then halt any further transaction processing. My original intentions aside, it sure would be nice if the developer had an easy way to verify that the transaction had been committed before deleting those persisted messages. This is the crux of a comment by celesteking, and I think it’s a pretty good idea.

We could handle this issue with something like the following:

client.on_commit do |commit_frame, *args|
# Delete all messages from file, database, etc
remove_all_messages_from_persistent_storage
end

# Pull messages from a file, database, etc.
messages = get_messages_from_persistent_storage

client.transaction do |t|
messages.each do |msg|
t.send msg.destination, msg.body
end
end

client.disconnect

However, this only works if we’re performing a single transaction. If we have multiple transactions, then we need to keep track of transaction IDs and our on_commit call turns into a giant case statement. Clearly this is a less than desirable solution.

A better approach might take the following form:

# Pull messages from a file, database, etc.
messages = get_messages_from_persistent_storage

client.transaction do |t|
t.on_abort do
# Called when the transaction is explicitly aborted or an IO
# exception prevents sending the final COMMIT frame.
end
t.on_commit do
remove_all_messages_from_persistent_storage
end

messages.each do |msg|
t.send msg.destination, msg.body
end
end

client.disconnect

I suppose these blocks could also be supplied as parameters to the transaction method (note: I’m using a Ruby 1.9 style hash)

client.transaction(on_abort: lambda { ... },
on_commit: lambda { ... }) do |t|
# ...
end

Personally, I prefer the former over the latter, but it requires a bit more effort to implement if I want to guarantee that any on_abort callback will be called even if it is defined after a the ABORT frame has already been generated (or if an IO exception has occurred while processing the transaction.)

Hopefully this illustrates why triggering a terminated event on the connection and re-raising the exception is both a useless and inadequate way of handling errors, and why something better is needed.

### Event Dispatching

The second serious issue is one I’ve touched on before: there are some major issues with how event callbacks are invoked.

1. Not all event callbacks will be invoked in the same thread, so your callbacks may run into synchronization issues.
2. Unless you spend a fair amount of time with the code base, you’ll probably have no idea which thread an event is going to be invoked within.
3. Changing the client’s state in a callback (e.g. re-connecting within the on_connection_terminated event) can produce errors that are spectacularly difficult to trace.
4. IO processing stops until the callbacks are completed. A long-running callback could very easily choke the connection.
5. If an exception is raised within a callback it may percolate up to the threaded processor, which will kill the IO loop and ruin everyone’s day.

I don’t think I really need to explain this problem further, so let’s move on to the last issue.

### The Code Base

There is a whole lot of code within OnStomp that has nothing to do with the actual Stomp protocol, which makes it difficult to figure out exactly what OnStomp is doing. Also, some of the library’s packaging makes no sense to me now – fortunately, that’s mostly my problem and not something anyone else should have to worry about.

### Fixing It

I’ve begun work on the proper fixes for these problems. The first step will be factoring out all the non-blocking IO stuff into a separate gem: io_unblock. Rather than a connection that dispatches events, this will be working with lambdas, procs or blocks that are always invoked from within the IO processor thread. Within OnStomp, these callbacks will not be available to developers, instead they will be used to enqueue events that an event dispatcher will invoke later. I don’t know if a callback invoking, threaded IO-ish object will be of use to anyone else, but as it will in no way be tied to the OnStomp gem or the Stomp protocol, I see no harm in releasing it on its own.

I’m toying with the idea of putting all of the event dispatching code into a separate gem, as well. However, I won’t know if this is warranted until I actually start mucking about with it. In any case, I do plan on running the dispatcher within yet another thread to resolve the issues surrounding the current event dispatch process.

There are a couple of things I need to keep in mind while making these changes. First, being able to safely reconnect within a on_connection_closed event callback should not be the exercise in coding gymnastics that it is right now. Second, I need to preserve the ability developers have to use the before_transmitting or before_<frame command> events to change frames before they are serialized and sent on to the broker. Right now this is trivial because the client triggers those events and after the callbacks complete, it passes the frame off to the connection. If event dispatching is handled within a separate thread, some care must be taken to ensure that the dispatcher has triggered those events before the frames are serialized.

I think that’s enough for now, I’ll post more when I start making some actual headway.

## Learn You a Haskell for Great Good

To be completely honest, I’m not much of a reader. My reading speed lies somewhere between “painfully slow” and “potentially illiterate,” so reading fiction, in particular, is more chore than hobby. I love works of reference when they have well constructed indices or, better yet, hyperlinks between related topics — my bookshelves and wikipedia usage can attest to this. I’m also fascinated with “meta-reading” fiction, where I read summaries, critiques and so forth of some work to gain insights into what it’s all about, but without reading the actual source material. This fascination is probably the only reason I didn’t fail every high school English class. The point of this, which I’m dangerously close to completely losing, is that my thoughts on a book are generally not of much use to anyone else.

With all of that in mind, I will say that Learn You a Haskell for Great Good! is probably the best book I’ve purchased in a while. I’m only into the 6th chapter now — again, painfully slow reading speed — but the tone, narrative, examples and pretty pictures make it an enjoyable read for me. There are externalities working in LYAH’s favor, namely my strong interest in Haskell as it appeals to “mathematics degree” brain, and a recent uneasiness I’ve developed regarding some elements of the Ruby/Rails world; however, I think the book would be just as engaging without these factors.

So, if you’re the kind of person who buys books based upon reviews that could have been written by an 11 year old, pick up a copy of Learn You a Haskell for Great Good! right now. Thank you, Miran Lipovača, for such a unique Haskell introduction.

On the topic of books, but not directly related to Learn You a Haskell, if you want a great introduction to combinatory logic, grab a copy of To Mock a Mockingbird by Raymond Smullyan. It’s entertaining, a fairly easy read and, unfortunately, out of print, so you’ll probably have to buy it second hand.

## Terrible Ideas

While lurking on the #ruby-lang channel on freenode.net, oddmund suggested creating a method_missing handler that would “auto-correct” misspelled method. His suggestion was of course a joke, but it was a pretty good one, so he implemented it. I provided my own implementation, along with some other terrible ideas.

I will never push terrible_things to RubyGems for two reasons:

1. It is a joke, obviously.
2. Its features are incompletely implemented.

The goal of implementing this gem was to have a bit of fun and maybe explore something potentially useful along the way. One useful bit came from Ryan Davis, Gem::Text includes a Levenshtein distance calculator, so that’s handy. The “semi-monadic” treatment of NilClass actually wasn’t all that interesting. The idea has been done a lot, and simply returning itself for method_missing doesn’t make it a real “Maybe monad.” The fun and interesting part has been the lazy evaluation bits. By overloading Object::new, all object instantiation can be evaluated lazily. With a bit of enumerator handling, we can work with infinite lists in Ruby by virtue of lazy evaluation.

Others have done similar things with Ruby, but most of them tend to only defer evaluation once, if further methods are invoked on the lazy “proxy”, then its fulfilled to a real object. To be fair, a useful lazy evaluation library really can’t be both correct and indefinitely lazy in Ruby — special care has to be taken when dealing with methods that mutate their object versus ones that return new objects.

Regardless, the whole exercise has been quite a bit of fun, and being able to transform, filter, and take from infinite lists is pretty sweet. I can definitely think of a cleaner implementation than the one derived from lazy evaluation, though it ultimately relies on the same trick of using Enumerator#next so that finite numbers of elements can be plucked from these lists that would be impossible to directly compute.

This post is a bit light, both in length and in technical content, my apologies. I would like to say that so far I’m impressed with so-called “e-cigarettes.” They still provide nicotine, so I doubt they’ll be much help in breaking my addiction, but at least I’m not getting my fix by breathing deeply of burning organic matter. Now if only I could get nicotine in an easily digested gel tablet.

## Bayesian Classification

### A Brief Distraction

After using the classifier I originally laid out in this post, I discovered that my method of calculating $P(D)$ was very flawed. I have made the appropriate revisions.

### The Overview

So, as everyone knows by now, Bayes’ Theorem is expressed as:

What I’m going to talk about isn’t new or novel, but it’s something that interests me, so there. I want to look at using Bayes’ theorem to classify documents. This has been done to death with regard to classifying spam, but I’m going to take a more general look. For our purposes, a document is a sequence of terms such as: “This cat isn’t made of butterscotch”; a paragraph, or the HTML source of this page. The idea is to classify a bunch of documents by hand to “train” our classifier so that when new documents come along, we can ask the classifier where they belong. One way to accomplish this is through the use of Bayes’ theorem.

### The Real Work

To start, we’ll need a way to tokenize a document into terms. The tokenizer’s implementation will depend upon the type of documents under consideration. To keep things somewhat simple, let’s assume we’re working with English phrases; thus, the terms produced by our tokenizer are simply words. Let’s suppose we have a set of categories, $\mathfrak{C} = \{ C_1, C_2, \ldots , C_n \}$. So, what we’re then after is the probability that a document belongs in a particular category, $C_m$, given these extracted terms:

We’re letting $D$ represent our document, and then expanding the document into its individual terms, $T_1,\ldots,T_n$. You’ll notice that we’ve skipped the expansion of $P(D)$, the justification for this will be explained shortly.

We can expand this using the rules of conditional probability:

This formulation is available on the Wikipedia page on Naive Bayes classifiers. As we are taking a naive approach, we assume there is no interdependence between terms, ie:

Now, when it comes to words, this is a obviously poor assumption. For instance, it is far more likely that the word “poor” would appear before “assumption” than the word “Faustian” in day-to-day phrases. However, we’re going to go ahead with the naive assumption for now, which simplifies our equation:

For any term, $T_i$, we take $P(T_i \mid C_m)$ to mean the probability of that term occurring given we’re working with category $C_m$, and we calculate it thusly:

where $t(x, y)$ is the number of times term $x$ occurs in category $y$. This brings us to $P(C_m)$, meaning the probability of choosing category $C$ without taking the document into consideration. There are a number of ways we could calculate $P(C_m)$, such as assuming that all categories are equally likely:

where $|\mathfrak{C}|$ denotes the number of categories in our classifier. While this works, a better measure of might take into account the number of documents a category has been trained with:

where $d(y)$ indicates the number of documents belonging to category $y$. All other things being equally, $P_2(C_m)$ will favor categories that have been fed the most documents. If each category is trained with the same number of documents, then we’re back to the uniform probability given earlier.

Basing $P(C_m)$ on document counts does pose a problem. Suppose the following documents belonged to the same category:

this document is pretty small
small this document is pretty
pretty small this document is
is pretty small this document
document is pretty small this

These are all distinct permutations of the same string, and thus distinct documents. The problem is that the classifier we are in the process of constructing deals with terms. Our assumption of the independence between terms means that their ordering has no bearing on the classifier. From this position, we could argue that each of the five documents listed above are actually the same document as far as our classifier is concerned, and if we relied on document counts for $P(C_m)$, this set of documents would give some category an unfair bias. So, let’s consider another alternative for calculating $P(C_m)$:

This approach is calculating the probability of a given category as the sum of occurrences of all terms in the category and the sum of occurrences of all terms across all categories. Unfortunately, this form may also be unfairly biased by document repetition (e.g. if we fed a category the five documents shown earlier.) Let’s consider one more alternative:

where $u(x, y)$ and $u^*(x)$ are defined as:

These definitions may seem complicated, but they are both very simple concepts: $u(x,y)$ is just counting the number of distinct terms in category $y$ and $u^*(x)$ is counting the total number of distinct terms from all categories.

While each of the four possibilities presented for $P(C_m)$ are very different from one another, they are all valid approaches. Personally, I feel $P_1(C_m)$ is too naive in its assumption that all categories are equally likely to be selected, and as I intend on keeping track of only terms and categories, I can also rule out $P_2(C_m)$. As mentioned earlier, $P_3(C_m)$ can be biased by repeated terms, so I’m going to opt for $P(C) = P_4(C_m)$. Again, I wish to stress that this is a personal choice, your Bayesian needs may differ from mine.

With that all taken care of, we can now expand our classifier to:

Our problem is now just one of counting. The expression may look horrible, but that is largely a result of each expression being explicitly defined. Now that we know what the expressions mean, let’s do an informal clean up of this warlock by substituting simpler symbols for our sweet mess of expressions. Before we do, we need to take a closer look at $P(D)$, which in earlier revisions of this post was explicitly, and incorrectly, defined. For a given document, $P(D)$ will be the same, regardless of the category under consideration. When our classifier is asked to classify a document, it will iterate over each of its categories and perform this calculation, which each of the final category probabilities will be scaled by this same term. So, our first simplification will be to substitute $P(D)$ with $\delta$, which we will not explicitly define as we will eventually discard it.

Now, let’s let $\upsilon$ be the number of unique terms in the category we are considering and $\Upsilon$ be the number of unique terms in all of our categories. Finally, we let $g(T_i)$ be the number of occurrences of term $T_i$ in the current category and $G$ be the occurrences of all terms in all of our categories. With these simplifications, our equation becomes:

Hopefully it is now painfully obvious that each category will be scaled by a common factor, namely $\frac{1}{\delta}$. The purpose of our classifier is to pick the category a given document most likely belongs in, and if they are all scaled by this common term we can disregard it. In an effort to preserve mathematical accuracy, I have opted to multiply both sides of the equation by $\delta$.

All we’ve gained from this simplification is some mental manageability. I wanted to derive the explicit calculations involved in Bayesian classification, but products of fractions containing summations don’t tend to be very memorable. Any additional calculations we may need to perform on the explicit form would have made things even messier, which serves as a nice segue into our next step.

In a world where we are free to perform these calculations without loss of precision, this expression is just fine. However, given that we are multiplying a series of numbers all in the range $[0, 1]$, a computer may eventually round a product to 0, and totally ruin our day. We can correct this with a little help from a logarithm*:

Is this logarithm business really necessary? Consider a document with 324 unique terms where, thanks the awesome power of contrived examples, $\frac{g(T_i)}{G} = 0.1$ for each term. When we take the product of all these terms, we end up with $0.1^{324}$. That’s a pretty small number, so small that Ruby mistakes it for 0:

0.1 ** 324 # => 0.0

Now, we did factor $\frac{1}{G^n}$ out of our product earlier, but $\frac{1}{G^{324}}$ is also bound to be very small. This post alone is up to 608 unique terms, just to give an example of how likely it is that our classifier might experience floating-point underflow.

### Digging Deeper

There are a few edge cases to consider, here are five I can think of off of the top of my head:

1. What if there are no known terms for a particular category?
2. What if there are no known terms for the classifier as a whole?
3. What if a term is unknown to a particular category?
4. What if a term is unknown to the classifier as a whole?
5. How do we handle repeated terms (i.e. $T_i = T_j$) in a document that our classifier is attempting to classify for us?

We can easily address the first two cases. If any category has not been trained, we assume $P(C | T_1,\ldots,T_n) = 0$, thus if none of our categories have been trained (case #2), we assume all categories are equally likely (or unlikely) and either return an arbitrary one or none at all. The third and fourth cases are consequences of the fact that we are estimating probabilities. Through training, our classifier is building approximations for these various probabilities. It is entirely possible, perhaps even very likely, that we will encounter documents we wish to classify containing terms that were not included in our training. If we rely solely on the calculations presented here, a foreign term will result in zeroing the overall probability or produce an error — either from a division by zero or from the undefined $\log(0)$ calculation. To prevent garbage results, we will need to assign a non-zero probability, $\epsilon$, to terms unknown to a particular category.

The fifth case deserves its own paragraph. To handle this situation, we defer to simple rules for conditional probabilities. Consider the following:

If $B$ and $C$ are describing the same event, then the expression simplifies to:

The same holds true for documents we wish to classify. We could argue that repeated terms do not describe the same event because the terms are occurring in distinct positions; however, we have not worked positional information into our classifier. It is important to note that when we are training our classifier with documents, we do take repeated terms into consideration.

Now that virtually every email client has, at one point or another, made use of some form of Naive Bayes classification, I realize it’s no longer the hot topic it once was. However, while reviewing the Ruby classifier, I discovered that their Bayes implementation was wrong. This was independently confirmed by Brian Muller. While working on a fork of it with Zeke, I discovered that while the approach is well known, there are nuances that can significantly impact the results — examples being the choice in $P(C)$ and the default probability, $\epsilon$.

### Foot Noted

#### Note: Logarithms

I would like to thank Brian Muller for explaining the use of logarithms in a Naive Bayes classifier he’s developing in Ruby, even if we don’t share the same views on $P(C)$. My Numerical Analysis professors would probably be displeased that I forgot about this trick. [ jump back ]

## Properties of Code: Challenges of Readability

I’ve written the first follow-up to Properties of Code: Functional Complexity about 3 times now and have scrapped it 3 times. Every attempt has been less “mathy” than the start of the series; each has contained an interesting point or two, but those nuggets get buried under a mountain of meandering.

The biggest hinderance has been that there has been too much “feeling” and “guess work” in my efforts. My desire to show a relationship between Functional (or, more precisely, Satisfaction) complexity and Kolmogorov complexity provided a pretty clear path for me to follow. Readability is a bit more vague.

What I need is more empirical data to direct my thoughts and efforts. I have some hypotheses, some suspicions and very little data. Espousing a position without the backing of mathematical formalisms or strong data makes me uneasy, and that tends to lead to a rambling and incoherent narrative. I do find something amusing in the fact that my efforts at writing on readability have, thus far, been virtually unreadable.

As I am a single-task kind of guy, I feel compelled to work out my thoughts on readability before moving on to anything else. However, it is pretty clear to me that doing so will totally stall this site. So, rather than fixate on writing a narrative around claims I cannot support at this time, I’m going to list my thoughts and start collecting data. Once I have some meaningful data to work with, I’ll dig into it and see what comes out. If I find support for my claims, great. If the data refutes some or all of my thoughts, even better. This will give me something to work from and hopefully produce a much more meaningful essay than the shit I’ve attempted to cobble together so far.

At any rate, here are my suspicions, naked and, for all intents and purposes, completely unfounded at this time:

1. Any link between Kolmogorov Complexity and Readability is a trivial one (eg: a program has high Kolmogorov complexity and is unreadable because it is nearly a random string of characters.)
2. The link between a “readable” story and “readable” code is probably stronger, but the two are not equivalent.
3. The visualization of quantitative data has implications in readability of code, and potentially the readability of programming languages.
4. Code produces two products: a solution to some problem to be consumed by users, and the code itself, to be consumed by developers. Writing code that only solves the problem is a partial victory.

Now that I’ve put a few thoughts out there, I’m mentally free to think about, and write on other topics, while I collect data and look for evidence.