Mathish2016-09-08T16:28:40-04:00http://mathish.com/Ian D. EcclesRefactoring OnStomp2012-01-08T00:00:00-05:00http://mathish.com/2012/01/08/io-unblock<p>Lately, I’ve been giving some thought to a few issues with
the <a href="https://github.com/meadvillerb/onstomp">OnStomp</a> gem and how I want
to address them. I’ll start by over-explaining the issues, and then
wrap up with how I plan to address them.</p>
<h3 id="handling-io-exceptions">Handling IO Exceptions</h3>
<p>In the base connection class, there’s a fair bit of exception rescuing
going on, but there is a problem with it. For example, let’s have a
look at <code>connections/base.rb</code>
<a href="https://github.com/meadvillerb/onstomp/blob/master/lib/onstomp/connections/base.rb#L209">starting at line 209</a></p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="k">begin</span>
<span class="k">if</span> <span class="n">data</span> <span class="o">=</span> <span class="n">read_nonblock</span>
<span class="vi">@read_buffer</span> <span class="o"><<</span> <span class="n">data</span>
<span class="vi">@last_received_at</span> <span class="o">=</span> <span class="no">Time</span><span class="o">.</span><span class="n">now</span>
<span class="n">serializer</span><span class="o">.</span><span class="n">bytes_to_frame</span><span class="p">(</span><span class="vi">@read_buffer</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">frame</span><span class="o">|</span>
<span class="k">yield</span> <span class="n">frame</span> <span class="k">if</span> <span class="nb">block_given?</span>
<span class="n">client</span><span class="o">.</span><span class="n">dispatch_received</span> <span class="n">frame</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">rescue</span> <span class="no">Errno</span><span class="o">::</span><span class="no">EINTR</span><span class="p">,</span> <span class="no">Errno</span><span class="o">::</span><span class="no">EAGAIN</span><span class="p">,</span> <span class="no">Errno</span><span class="o">::</span><span class="no">EWOULDBLOCK</span>
<span class="c1"># do not</span>
<span class="k">rescue</span> <span class="no">EOFError</span>
<span class="n">triggered_close</span> <span class="vg">$!</span><span class="o">.</span><span class="n">message</span>
<span class="k">rescue</span> <span class="no">Exception</span>
<span class="c1"># TODO: [omitted for now]</span>
<span class="n">triggered_close</span> <span class="vg">$!</span><span class="o">.</span><span class="n">message</span><span class="p">,</span> <span class="ss">:terminated</span>
<span class="k">raise</span>
<span class="k">end</span></code></pre></div>
<p>The first <code>rescue</code> handles exceptions that can be raised when reading
data would cause the system to block. We handle this situation by doing
nothing and waiting until later to try reading again. The second
exception we address is an <code>EOFError</code>, which isn’t necessarily
an error. For example, if the client has told the server it is all done
and intends to disconnect, the server may shutdown the connection which
can cause this exception to be raised. Even when an EOFError is
unexpected, it always tells us that it’s time to shutdown the connection
and move on.</p>
<p>This brings us to the final <code>rescue</code> block, which handles every other
kind of exception. The system responds by shutting down the connection,
firing a <code>terminated</code> event, and then re-raising the exception. I chose
to re-raise the exception to give additional feedback to developers,
which might have been handy if it weren’t for the fact that the
exception is being raised within a separate IO processing thread.
Thus, the exception can only be rescued when the thread is joined, which
doesn’t happen until the developer has called <code>disconnect</code> on the
<code>OnStomp::Client</code> object. By that time, it is far too late to handle
the error in any meaningful way.</p>
<p>So, it’s pretty obvious that re-raising exceptions in this fashion is
useless, now I want to show you why it’s also problematic. Suppose that
we’ve got this bit of code:</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="c1"># Pull messages from a file, database, etc.</span>
<span class="n">messages</span> <span class="o">=</span> <span class="n">get_messages_from_persistent_storage</span>
<span class="n">client</span><span class="o">.</span><span class="n">transaction</span> <span class="k">do</span> <span class="o">|</span><span class="n">t</span><span class="o">|</span>
<span class="n">messages</span><span class="o">.</span><span class="n">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">msg</span><span class="o">|</span>
<span class="n">t</span><span class="o">.</span><span class="n">send</span> <span class="n">msg</span><span class="o">.</span><span class="n">destination</span><span class="p">,</span> <span class="n">msg</span><span class="o">.</span><span class="n">body</span>
<span class="c1"># Delete the message from file, database, etc</span>
<span class="n">remove_message_from_persistent_storage</span> <span class="n">m</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="n">client</span><span class="o">.</span><span class="n">disconnect</span></code></pre></div>
<p>If an <code>Errno::ECONNRESET</code> exception is raised before all of the
messages have been sent, the developer won’t know it until
<code>client.disconnect</code> has been called. Now, from the server’s perspective
everything behaves as expected: the transaction starts with a <code>BEGIN</code>
frame, some messages are received, and the connection is closed before
a <code>COMMIT</code> frame is received, so the received messages are discarded.
The developer, on the other hand, gets screwed: the exception is raised
in a separate thread from where the <code>transaction</code> block is evaluated.
All of the messages have been deleted from the persistent storage and
not one of them has actually been accepted by the server, which is a
dick move on my part. Now, to be fair, I never intended the
<code>transaction</code> block to work in this way. The intent of <code>transaction</code>
was to deal with exceptions generated within the block itself and
either commit or abort the transaction automatically. In other words, if
<code>remove_message_from_persistent_storage</code> raised an exception, the
<code>transaction</code> block would be there to rescue it, send an <code>ABORT</code> frame
to the server, and then halt any further transaction processing. My
original intentions aside, it sure would be nice if the developer had
an easy way to verify that the transaction had been committed before
deleting those persisted messages. This is the crux of
<a href="https://github.com/meadvillerb/onstomp/issues/13#issuecomment-3251571">a comment by celesteking</a>,
and I think it’s a pretty good idea.</p>
<p>We could handle this issue with something like the following:</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="n">client</span><span class="o">.</span><span class="n">on_commit</span> <span class="k">do</span> <span class="o">|</span><span class="n">commit_frame</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="o">|</span>
<span class="c1"># Delete all messages from file, database, etc</span>
<span class="n">remove_all_messages_from_persistent_storage</span>
<span class="k">end</span>
<span class="c1"># Pull messages from a file, database, etc.</span>
<span class="n">messages</span> <span class="o">=</span> <span class="n">get_messages_from_persistent_storage</span>
<span class="n">client</span><span class="o">.</span><span class="n">transaction</span> <span class="k">do</span> <span class="o">|</span><span class="n">t</span><span class="o">|</span>
<span class="n">messages</span><span class="o">.</span><span class="n">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">msg</span><span class="o">|</span>
<span class="n">t</span><span class="o">.</span><span class="n">send</span> <span class="n">msg</span><span class="o">.</span><span class="n">destination</span><span class="p">,</span> <span class="n">msg</span><span class="o">.</span><span class="n">body</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="n">client</span><span class="o">.</span><span class="n">disconnect</span></code></pre></div>
<p>However, this only works if we’re performing a single transaction. If
we have multiple transactions, then we need to keep track of transaction
IDs and our <code>on_commit</code> call turns into a giant <code>case</code> statement.
Clearly this is a less than desirable solution.</p>
<p>A better approach might take the following form:</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="c1"># Pull messages from a file, database, etc.</span>
<span class="n">messages</span> <span class="o">=</span> <span class="n">get_messages_from_persistent_storage</span>
<span class="n">client</span><span class="o">.</span><span class="n">transaction</span> <span class="k">do</span> <span class="o">|</span><span class="n">t</span><span class="o">|</span>
<span class="n">t</span><span class="o">.</span><span class="n">on_abort</span> <span class="k">do</span>
<span class="c1"># Called when the transaction is explicitly aborted or an IO</span>
<span class="c1"># exception prevents sending the final COMMIT frame.</span>
<span class="k">end</span>
<span class="n">t</span><span class="o">.</span><span class="n">on_commit</span> <span class="k">do</span>
<span class="n">remove_all_messages_from_persistent_storage</span>
<span class="k">end</span>
<span class="n">messages</span><span class="o">.</span><span class="n">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">msg</span><span class="o">|</span>
<span class="n">t</span><span class="o">.</span><span class="n">send</span> <span class="n">msg</span><span class="o">.</span><span class="n">destination</span><span class="p">,</span> <span class="n">msg</span><span class="o">.</span><span class="n">body</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="n">client</span><span class="o">.</span><span class="n">disconnect</span></code></pre></div>
<p>I suppose these blocks could also be supplied as parameters to
the <code>transaction</code> method (note: I’m using a Ruby 1.9 style hash)</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="n">client</span><span class="o">.</span><span class="n">transaction</span><span class="p">(</span><span class="ss">on_abort</span><span class="p">:</span> <span class="nb">lambda</span> <span class="p">{</span> <span class="o">.</span><span class="n">.</span><span class="o">.</span> <span class="p">},</span>
<span class="ss">on_commit</span><span class="p">:</span> <span class="nb">lambda</span> <span class="p">{</span> <span class="o">.</span><span class="n">.</span><span class="o">.</span> <span class="p">})</span> <span class="k">do</span> <span class="o">|</span><span class="n">t</span><span class="o">|</span>
<span class="c1"># ...</span>
<span class="k">end</span></code></pre></div>
<p>Personally, I prefer the former over the latter, but it requires a bit
more effort to implement if I want to guarantee that any <code>on_abort</code>
callback will be called even if it is defined after a the <code>ABORT</code> frame
has already been generated (or if an IO exception has occurred while
processing the transaction.)</p>
<p>Hopefully this illustrates why triggering a <code>terminated</code> event on the
connection and re-raising the exception is both a useless and
inadequate way of handling errors, and why something better is needed.</p>
<h3 id="event-dispatching">Event Dispatching</h3>
<p>The second serious issue is
<a href="/2011/04/24/events-on-a-train.html">one I’ve touched on before</a>:
there are some major issues with how event callbacks are invoked.</p>
<ol>
<li>Not all event callbacks will be invoked in the same thread, so your
callbacks may run into synchronization issues.</li>
<li>Unless you spend a fair amount of time with the code base, you’ll
probably have no idea which thread an event is going to be invoked
within.</li>
<li>Changing the client’s state in a callback (e.g. re-connecting
within the <code>on_connection_terminated</code> event) can produce
<a href="https://github.com/meadvillerb/onstomp/issues/12">errors that are spectacularly difficult to trace</a>.</li>
<li>IO processing stops until the callbacks are completed. A
long-running callback could very easily choke the connection.</li>
<li>If an exception is raised within a callback it may
percolate up to the threaded processor, which will kill the IO loop
and ruin everyone’s day.</li>
</ol>
<p>I don’t think I really need to explain this problem further, so let’s
move on to the last issue.</p>
<h3 id="the-code-base">The Code Base</h3>
<p>There is a whole lot of code within OnStomp that has <em>nothing</em> to do
with the actual Stomp protocol, which makes it difficult to figure out
exactly what OnStomp is doing. Also, some of the library’s packaging
makes no sense to me now – fortunately, that’s mostly my problem and
not something anyone else should have to worry about.</p>
<h3 id="fixing-it">Fixing It</h3>
<p>I’ve begun work on the proper fixes for these problems. The first step
will be factoring out all the non-blocking IO stuff into a separate
gem: <a href="https://github.com/iande/io_unblock">io_unblock</a>. Rather than
a connection that dispatches events, this will be working with
lambdas, procs or blocks that are always invoked from within
the IO processor thread. Within OnStomp, these callbacks will not be
available to developers, instead they will be used to enqueue events
that an event dispatcher will invoke later. I don’t know if a callback
invoking, threaded IO-ish object will be of use to anyone else, but
as it will in no way be tied to the OnStomp gem or the Stomp protocol,
I see no harm in releasing it on its own.</p>
<p>I’m toying with the idea of putting all of the event dispatching code
into a separate gem, as well. However, I won’t know if this is
warranted until I actually start mucking about with it. In any case,
I do plan on running the dispatcher within yet another thread to
resolve the issues surrounding the current event dispatch process.</p>
<p>There are a couple of things I need to keep in mind while making these
changes. First, being able to safely reconnect within a
<code>on_connection_closed</code> event callback should not be the exercise in
coding gymnastics that it is right now. Second, I need to preserve
the ability developers have to use the <code>before_transmitting</code> or
<code>before_<frame command></code> events to change frames before they are
serialized and sent on to the broker. Right now this is trivial because
the client triggers those events and after the callbacks complete, it
passes the frame off to the connection. If event dispatching is
handled within a separate thread, some care must be taken to ensure
that the dispatcher has triggered those events before the frames are
serialized.</p>
<p>I think that’s enough for now, I’ll post more when I start making some
actual headway.</p>
Learn You a Haskell for Great Good2011-06-07T00:00:00-04:00http://mathish.com/2011/06/07/learn-you-a-haskell<p>To be completely honest, I’m not much of a reader. My reading speed lies
somewhere between “painfully slow” and “potentially illiterate,” so reading
fiction, in particular, is more chore than hobby. I love works of reference
when they have well constructed indices or, better yet, hyperlinks
between related topics — my bookshelves and wikipedia usage can
attest to this. I’m also fascinated with “meta-reading” fiction, where I read
summaries, critiques and so forth of some work to gain insights into what
it’s all about, but without reading the actual source material. This
fascination is probably the only reason I didn’t fail every high school
English class. The point of this, which I’m dangerously close to completely
losing, is that my thoughts on a book are generally not of much use to anyone
else.</p>
<p>With all of that in mind, I will say that
<em><a href="http://learnyouahaskell.com/">Learn You a Haskell for Great Good!</a></em> is
probably the best book I’ve purchased in a while. I’m only into the 6th
chapter now — again, painfully slow reading speed — but the
tone, narrative, examples and pretty pictures make it an enjoyable read for
me. There are externalities working in <em>LYAH</em>’s favor, namely my strong
interest in Haskell as it appeals to “mathematics degree” brain, and a recent
uneasiness I’ve developed regarding some elements of the Ruby/Rails world;
however, I think the book would be just as engaging without these factors.</p>
<p>So, if you’re the kind of person who buys books based upon reviews that
could have been written by an 11 year old, <a href="http://nostarch.com/lyah.htm">pick up a copy</a>
of <em>Learn You a Haskell for Great Good!</em> right now.
Thank you, Miran Lipovača, for such a unique Haskell introduction.</p>
<p>On the topic of books, but not directly related to <em>Learn You a Haskell</em>,
if you want a great introduction to combinatory logic, grab a copy of
<em><a href="http://www.amazon.com/Mock-Mockingbird-Raymond-Smullyan/dp/0192801422/ref=sr_1_1?ie=UTF8&s=books&qid=1307464282&sr=8-1">To Mock a Mockingbird</a></em>
by Raymond Smullyan. It’s entertaining, a fairly easy read and, unfortunately,
out of print, so you’ll probably have to buy it second hand.</p>
Terrible Ideas2011-06-05T00:00:00-04:00http://mathish.com/2011/06/05/terrible-ideas<p>While lurking on the #ruby-lang channel on freenode.net,
<a href="https://github.com/jraregris">oddmund</a> suggested creating a <code>method_missing</code>
handler that would “auto-correct” misspelled method. His suggestion was of
course a joke, but it was a pretty good one, so
<a href="https://github.com/jraregris/torispelling">he implemented it</a>. I provided
my own implementation, along with some other
<a href="https://github.com/iande/terrible_ideas">terrible ideas</a>.</p>
<p>I will never push <code>terrible_things</code> to RubyGems for two reasons:</p>
<ol>
<li>It is a joke, obviously.</li>
<li>Its features are incompletely implemented.</li>
</ol>
<p>The goal of implementing this gem was to have a bit of fun and maybe explore
something potentially useful along the way. One useful bit came from
<a href="http://blog.zenspider.com/">Ryan Davis</a>, <code>Gem::Text</code> includes a Levenshtein
distance calculator, so that’s handy. The “semi-monadic” treatment of
<code>NilClass</code> actually wasn’t all that interesting. The idea has been done
a lot, and simply returning itself for <code>method_missing</code> doesn’t make it
a real “Maybe monad.” The fun and interesting part has been the lazy
evaluation bits. By overloading <code>Object::new</code>, all object instantiation can
be evaluated lazily. With a bit of enumerator handling, we can work with
infinite lists in Ruby by virtue of lazy evaluation. </p>
<p><a href="https://rubygems.org/search?utf8=%E2%9C%93&query=lazy">Others have done similar things</a>
with Ruby, but most of them tend to only defer evaluation once, if further
methods are invoked on the lazy “proxy”, then its fulfilled to a real object.
To be fair, a useful lazy evaluation library really can’t be both correct and
indefinitely lazy in Ruby — special care has to be taken when dealing
with methods that mutate their object versus ones that return new objects.</p>
<p>Regardless, the whole exercise has been quite a bit of fun, and being able
to transform, filter, and take from infinite lists is pretty sweet. I can
definitely think of a cleaner implementation than the one derived from lazy
evaluation, though it ultimately relies on the same trick of using
<code>Enumerator#next</code> so that finite numbers of elements can be plucked from
these lists that would be impossible to directly compute.</p>
<p>This post is a bit light, both in length and in technical content, my
apologies. I would like to say that so far I’m impressed with so-called
“e-cigarettes.” They still provide nicotine, so I doubt they’ll be much help
in breaking my addiction, but at least I’m not getting my fix by breathing
deeply of burning organic matter. Now if only I could get nicotine in an
easily digested gel tablet.</p>
Bayesian Classification2011-05-26T00:00:00-04:00http://mathish.com/2011/05/26/bayes-classifying<h3 id="a-brief-distraction">A Brief Distraction</h3>
<p>After using the classifier I originally laid out in this post, I
discovered that my method of calculating P(D) was very flawed.
I have made the appropriate revisions.</p>
<h3 id="the-overview">The Overview</h3>
<p>So, as everyone knows by now, <a href="http://en.wikipedia.org/wiki/Bayes'_theorem">Bayes’ Theorem</a>
is expressed as:</p>
P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}
<p>What I’m going to talk about isn’t new or novel, but it’s something that
interests me, so there. I want to look at using Bayes’ theorem to
classify <dfn>documents</dfn>. This has been done to death with regard
to classifying spam, but I’m going to take a more general look. For our
purposes, a <dfn>document</dfn> is a sequence of terms such as:
“This cat isn’t made of butterscotch”; a paragraph, or the HTML source of this
page. The idea is to classify a bunch of documents by hand to “train” our
classifier so that when new documents come along, we can ask the classifier
where they belong. One way to accomplish this is through the use of Bayes’
theorem.</p>
<h3 id="the-real-work">The Real Work</h3>
<p>To start, we’ll need a way to tokenize a document into <dfn>terms</dfn>. The
tokenizer’s implementation will depend upon the type of documents under
consideration. To keep things somewhat simple, let’s assume we’re working
with English phrases; thus, the terms produced by our tokenizer are simply
words. Let’s suppose we have a set of categories,
\mathfrak{C} = \{ C_1, C_2, \ldots , C_n \}.
So, what we’re then after is the probability that a document belongs
in a particular category, C_m, given these extracted terms:</p>
\begin{aligned}
P(C_m \mid D) &= \frac{P(C_m) P(D \mid C_m)}{P(D)} \\
P(C_m \mid T_1,\ldots,T_n) &= \frac{P(C_m) P(T_1,\ldots,T_n \mid C_m)}{P(D)}
\end{aligned}
<p>We’re letting D represent our <dfn>document</dfn>, and then
expanding the document into its individual terms, T_1,\ldots,T_n.
You’ll notice that we’ve skipped the expansion of P(D), the
justification for this will be explained shortly.</p>
<p>We can expand this using the rules of conditional probability:</p>
\begin{aligned}
P(C_m \mid T_1, \ldots, T_n) &= \frac{P(C_m) P(T_1, \ldots, T_n \mid C_m)}{P(D)} \\
&= \frac{P(C_m) P(T_1 \mid C_m) P(T_2, \ldots, T_n \mid C, T_1)}{P(D)} \\
&= \frac{P(C_m) P(T_1 \mid C_m) P(T_2 \mid C, T_1) P(T_3, \ldots, T_n \mid C, T_1, T_2)}{P(D)} \\
&= \frac{P(C_m) P(T_1 \mid C_m) P(T_2 \mid C, T_1) \ldots P(T_n \mid C, T_1, \ldots, T_{n-1})}{P(D)}
\end{aligned}
<p>This formulation is available on the Wikipedia page on <a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">Naive Bayes classifiers</a>.
As we are taking a naive approach, we assume there is no interdependence
between terms, ie:</p>
P(T_i \mid T_j) = P(T_i)
<p>Now, when it comes to words, this is a obviously poor assumption. For instance, it is far
more likely that the word “poor” would appear before “assumption” than the
word “Faustian” in day-to-day phrases. However, we’re going to go ahead with
the naive assumption for now, which simplifies our equation:</p>
\begin{aligned}
P(C_m \mid T_1, \ldots, T_n) &= \frac{P(C_m) P(T_1 \mid C_m) P(T_2 \mid C_m)
\ldots P(T_n \mid C_m)}{P(D)} \\
&= \frac{P(C_m)}{P(D)} \prod_{i=1}^n P(T_i \mid C_m)
\end{aligned}
<p>For any term, T_i, we take P(T_i \mid C_m) to
mean the probability of that term occurring given we’re working with
category C_m, and we calculate it thusly:</p>
P(T_i \mid C_m) = \frac{t(T_i, C_m)}{\sum_k t(T_k,C_m)}
<p>where t(x, y) is the number of times term x
occurs in category y. This brings us to P(C_m),
meaning the probability of choosing category C without taking
the document into consideration. There are a number of ways we could
calculate P(C_m), such as assuming that all categories are
equally likely:</p>
P_1(C_m) = \frac{1}{|\mathfrak{C}|}
<p>where |\mathfrak{C}| denotes the number of categories in our
classifier. While this works, a better measure of might take into account the
number of documents a category has been trained with:</p>
P_2(C_m) = \frac{d(C_m)}{\sum_k d(C_k)}
<p>where d(y) indicates the number of documents belonging to
category y. All other things being equally,
P_2(C_m) will favor categories that have been fed the most
documents. If each category is trained with the same number of documents,
then we’re back to the uniform probability given earlier.</p>
<p>Basing P(C_m) on document counts does pose a problem. Suppose
the following documents belonged to the same category:</p>
<pre><code>this document is pretty small
small this document is pretty
pretty small this document is
is pretty small this document
document is pretty small this
</code></pre>
<p>These are all distinct permutations of the same string, and thus distinct
documents. The problem is that the classifier we are in the process of
constructing deals with terms. Our assumption of the independence between
terms means that their ordering has no bearing on the classifier. From this
position, we could argue that each of the five documents listed above are
actually the same document as far as our classifier is concerned, and if we
relied on document counts for P(C_m), this set of documents
would give some category an unfair bias. So, let’s consider another
alternative for calculating P(C_m):</p>
P_3(C_m) = \frac{\sum_k t(T_k, C_m)}{\sum_j \sum_k t(T_k, C_j)}
<p>This approach is calculating the probability of a given category as the
sum of occurrences of all terms in the category and the sum of occurrences
of all terms across all categories. Unfortunately, this form may also be
unfairly biased by document repetition (e.g. if we fed a category the
five documents shown earlier.) Let’s consider one more alternative:</p>
P_4(C_m) = \frac{\sum_k u(T_k, C_m)}{\sum_k u^*(T_k)}
<p>where u(x, y) and u^*(x) are defined
as:</p>
\begin{aligned}
u(x, y) &= \left\{
\begin{array}{ll}
1 & \text{if \(x \in y\)} \\
0 & \text{otherwise}
\end{array}
\right. \\
u^*(x) &= \left\{
\begin{array}{ll}
1 & \text{if \(\exists y \in \mathfrak{C} : x \in y\)} \\
0 & \text{otherwise}
\end{array}
\right. \\
\end{aligned}
<p>These definitions may seem complicated, but they are both very simple
concepts: u(x,y) is just counting the number of distinct terms
in category y and u^*(x) is counting the total
number of distinct terms from all categories.</p>
<p>While each of the four possibilities presented for P(C_m) are
very different from one another, they are all valid approaches. Personally,
I feel P_1(C_m) is too naive in its assumption that all
categories are equally likely to be selected, and as I intend on keeping
track of only terms and categories, I can also rule out
P_2(C_m). As mentioned earlier, P_3(C_m) can
be biased by repeated terms, so I’m going to opt for
P(C) = P_4(C_m). Again, I wish to stress that this is a
personal choice, your Bayesian needs may differ from mine.</p>
<p>With that all taken care of, we can now expand our classifier to:</p>
\begin{aligned}
P(C_m \mid T_1, \ldots, T_n) &= \frac{P(C_m)}{P(D)} \prod_{i=1}^n P(T_i \mid C_m) \\
&= \frac{\sum_k u(T_k, C_m)}{P(D) \sum_k u^*(T_k)} \left ( \prod_{i=1}^n
\frac{t(T_i, C_m)}{\sum_k t(T_k,C_m)} \right )
\end{aligned}
<p>Our problem is now just one of counting. The expression may look horrible,
but that is largely a result of each expression being explicitly defined. Now
that we know what the expressions mean, let’s do an informal clean up of
this warlock by substituting simpler symbols for our sweet mess of expressions.
Before we do, we need to take a closer look at P(D), which in
earlier revisions of this post was explicitly, and incorrectly, defined.
For a given document, P(D) will be the same, regardless of the
category under consideration. When our classifier is asked to classify a
document, it will iterate over each of its categories and perform this
calculation, which each of the final category probabilities will be scaled
by this same term. So, our first simplification will be to substitute
P(D) with \delta, which we will not
explicitly define as we will eventually discard it.</p>
<p>Now, let’s let \upsilon be the number of unique terms in the
category we are considering and \Upsilon be the number of
unique terms in all of our categories. Finally, we let g(T_i)
be the number of occurrences of term T_i in the current
category and G be the occurrences of all terms in all of our
categories. With these simplifications, our equation becomes:</p>
\begin{aligned}
P(C_m \mid T_1,\ldots,T_n) &= \frac{\upsilon}{\delta \Upsilon}
\prod_{i=1}^n \frac{g(T_i)}{G} \\
&= \frac{\upsilon}{\delta \Upsilon G^n} \prod_{i=1}^n g(T_i) \\
&= \frac{1}{\delta} \frac{\upsilon}{\Upsilon G^n} \prod_{i=1}^n g(T_i) \\
\delta P(C_m \mid T_1,\ldots,T_n) &= \frac{\upsilon}{\Upsilon G^n}
\prod_{i=1}^n g(T_i)
\end{aligned}
<p>Hopefully it is now painfully obvious that each category will be scaled
by a common factor, namely \frac{1}{\delta}. The purpose
of our classifier is to pick the category a given document most likely
belongs in, and if they are all scaled by this common term we can disregard
it. In an effort to preserve mathematical accuracy, I have opted to multiply
both sides of the equation by \delta.</p>
<p>All we’ve gained from this simplification is some mental manageability. I
wanted to derive the explicit calculations involved in Bayesian
classification, but products of fractions containing summations don’t tend
to be very memorable. Any additional calculations we may need to perform
on the explicit form would have made things even messier, which serves
as a nice segue into our next step.</p>
<p>In a world where we are free to perform these calculations without loss of
precision, this expression is just fine. However, given that we are
multiplying a series of numbers all in the range [0, 1], a
computer may eventually round a product to 0, and totally ruin our day. We
can correct this with a little help from a
<a href="#note-logarithms" id="ref-logarithms">logarithm*</a>:</p>
\begin{aligned}
\log \left( \delta P(C_m \mid T_1,\ldots,T_n) \right) &= \log \left (
\frac{\upsilon}{\Upsilon G^n} \prod_{i=1}^n g(T_i)
\right) \\
\log \delta + \log P(C_m \mid T_1,\ldots,T_n) &=
\log \left( \frac{\upsilon}{\Upsilon G^n} \right) +
\log \left( \prod_{i=1}^n g(T_i) \right) \\
&= \log \left( \frac{\upsilon}{\Upsilon} \right) - n \log G +
\sum_{i=1}^n \log \left( g(T_i) \right)
\end{aligned}
<p>Is this logarithm business really necessary? Consider a document with
324 unique terms where, thanks the awesome power of contrived examples,
\frac{g(T_i)}{G} = 0.1 for each term. When we take the
product of all these terms, we end up with 0.1^{324}. That’s
a pretty small number, so small that Ruby mistakes it for 0:</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="mi">0</span><span class="o">.</span><span class="mi">1</span> <span class="o">**</span> <span class="mi">324</span> <span class="c1"># => 0.0</span></code></pre></div>
<p>Now, we did factor \frac{1}{G^n} out of our product earlier, but
\frac{1}{G^{324}} is also bound to be very small. This
post alone is up to 608 unique terms, just to give an example of how likely
it is that our classifier might experience floating-point underflow.</p>
<h3 id="digging-deeper">Digging Deeper</h3>
<p>There are a few edge cases to consider, here are five I can think of off of
the top of my head:</p>
<ol>
<li>What if there are no known terms for a particular category?</li>
<li>What if there are no known terms for the classifier as a whole?</li>
<li>What if a term is unknown to a particular category?</li>
<li>What if a term is unknown to the classifier as a whole?</li>
<li>How do we handle repeated terms (i.e. T_i = T_j) in a
document that our classifier is attempting to classify for us?</li>
</ol>
<p>We can easily address the first two cases. If any category has not been trained, we
assume P(C | T_1,\ldots,T_n) = 0, thus if none of our
categories have been trained (case #2), we assume all categories are equally
likely (or unlikely) and either return an arbitrary one or none at all.
The third and fourth cases are consequences of the fact that we are
estimating probabilities. Through training, our classifier is building
approximations for these various probabilities. It is
entirely possible, perhaps even very likely, that we will encounter documents
we wish to classify containing terms that were not included in our training.
If we rely solely on the calculations presented here, a foreign term will
result in zeroing the overall probability or produce an error —
either from a division by zero or from the undefined \log(0)
calculation. To prevent garbage results, we will need to assign a non-zero
probability, \epsilon, to terms unknown to a particular
category.</p>
<p>The fifth case deserves its own paragraph. To handle this situation, we
defer to simple rules for conditional probabilities. Consider the following:</p>
P(A | B, C, D) = P(A | B \cap C \cap D)
<p>If B and C are describing the same event, then
the expression simplifies to:</p>
P(A | B \cap D)
<p>The same holds true for documents we wish to <em>classify</em>. We could argue that
repeated terms do not describe the same event because the terms are occurring
in distinct positions; however, we have not worked positional information into
our classifier. It is important to note that when we are <em>training</em> our
classifier with documents, we do take repeated terms into consideration.</p>
<h3 id="final-business">Final Business</h3>
<p>Now that virtually every email client has, at one point or another, made
use of some form of Naive Bayes classification, I realize it’s no longer
the hot topic it once was. However, while reviewing the Ruby
<a href="https://rubygems.org/gems/classifier">classifier</a>, I discovered that their
Bayes implementation was wrong. This was independently confirmed by
<a href="https://github.com/bmuller">Brian Muller</a>. While working on a fork of it
with <a href="https://github.com/ezkl">Zeke</a>, I discovered that while the approach is
well known, there are nuances that can significantly impact the results
— examples being the choice in P(C) and the default
probability, \epsilon.</p>
<h3 id="foot-noted">Foot Noted</h3>
<h4 id="note-logarithms">Note: Logarithms</h4>
<p>I would like to thank <a href="https://github.com/bmuller">Brian Muller</a> for
explaining the use of logarithms in a
<a href="https://github.com/livingsocial/ankusa/blob/master/lib/ankusa/naive_bayes.rb">Naive Bayes classifier</a>
he’s developing in Ruby, even if we don’t share the same views on
P(C). My Numerical Analysis professors would probably be
displeased that I forgot about this trick.
[ <a href="#ref-logarithms">jump back</a> ]</p>
Properties of Code: Challenges of Readability2011-05-21T00:00:00-04:00http://mathish.com/2011/05/21/empirically-challenged<p>I’ve written the first follow-up to <a href="http://mathish.com/2011/05/08/mother-functional.html">Properties of Code: Functional Complexity</a>
about 3 times now and have scrapped it 3 times. Every attempt has been less
“mathy” than the start of the series; each has contained an interesting
point or two, but those nuggets get buried under a mountain of meandering.</p>
<p>The biggest hinderance has been that there has been too much “feeling” and
“guess work” in my efforts. My desire to show a relationship between
Functional (or, more precisely, Satisfaction) complexity and Kolmogorov
complexity provided a pretty clear path for me to follow. Readability is a
bit more vague.</p>
<p>What I need is more empirical data to direct my thoughts and efforts. I have
some hypotheses, some suspicions and very little data. Espousing a position
without the backing of mathematical formalisms or strong data makes me uneasy,
and that tends to lead to a rambling and incoherent narrative. I do find
something amusing in the fact that my efforts at writing on readability have,
thus far, been virtually unreadable.</p>
<p>As I am a single-task kind of guy, I feel compelled to work out my thoughts
on readability before moving on to anything else. However, it is pretty
clear to me that doing so will totally stall this site. So, rather than
fixate on writing a narrative around claims I cannot support at this time,
I’m going to list my thoughts and start collecting data. Once I have some
meaningful data to work with, I’ll dig into it and see what comes out. If
I find support for my claims, great. If the data refutes some or all of my
thoughts, even better. This will give me something to work from and
hopefully produce a much more meaningful essay than the shit I’ve attempted
to cobble together so far.</p>
<p>At any rate, here are my suspicions, naked and, for all intents and purposes,
completely unfounded at this time:</p>
<ol>
<li>Any link between Kolmogorov Complexity and Readability is a trivial one
(eg: a program has high Kolmogorov complexity and is unreadable because it
is nearly a random string of characters.)</li>
<li>The link between a “readable” story and “readable” code is probably
stronger, but the two are not equivalent.</li>
<li>The visualization of quantitative data has implications in readability of
code, and potentially the readability of programming languages.</li>
<li>Code produces two products: a solution to some problem to be consumed by
users, and the code itself, to be consumed by developers. Writing code
that only solves the problem is a partial victory.</li>
</ol>
<p>Now that I’ve put a few thoughts out there, I’m mentally free to think about,
and write on other topics, while I collect data and look for evidence.</p>
The Un-Ruby2011-05-19T00:00:00-04:00http://mathish.com/2011/05/19/un-ruby<p>A follow up to <a href="http://mathish.com/2011/05/08/mother-functional.html">Properties of Code: Functional Complexity</a>
is coming. It’s much less “mathy” than its predecessor but serves as a
jumping off point for the next in the series. However, there is something
that has been nagging at me after watching some of the talks at this year’s
RailsConf. It’s a ubiquitous and seemingly trivial thing, but it bothers
the hell out of me: <code>ActiveSupport::Concern</code>.</p>
<p>For as long as I’ve been using Rails, a common idiom for plugins that extend
the functionality of <code>ActiveRecord::Base</code> and other Rails classes follows:</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="k">module</span> <span class="nn">MyRailsExtension</span>
<span class="k">def</span> <span class="nc">self</span><span class="o">.</span><span class="nf">included</span> <span class="n">base</span>
<span class="n">base</span><span class="o">.</span><span class="n">extend</span> <span class="no">ClassMethods</span>
<span class="n">base</span><span class="o">.</span><span class="n">send</span> <span class="ss">:include</span><span class="p">,</span> <span class="no">InstanceMethods</span>
<span class="k">end</span>
<span class="k">module</span> <span class="nn">ClassMethods</span>
<span class="k">def</span> <span class="nf">my_dsl_extension</span>
<span class="c1"># ...</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">module</span> <span class="nn">InstanceMethods</span>
<span class="c1"># ...</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">class</span> <span class="nc">MyModel</span> <span class="o"><</span> <span class="no">ActiveRecord</span><span class="o">::</span><span class="no">Base</span>
<span class="kp">include</span> <span class="no">MyRailsExtension</span>
<span class="c1"># ...</span>
<span class="k">end</span></code></pre></div>
<p>I came to Ruby by way of Rails, and many of my early plugins used this
pattern. As time went on, I began doing more straight Ruby coding and
learned to embrace both <code>extend</code> and <code>include</code>. They send clear signals as
to how the mixed in modules will behave. Using the <code>self.included</code> Railsism
obscures that for the sake of adding instance and singleton methods with
a single line. This bothers me a bit.</p>
<p>This pattern is so common that <code>ActiveSupport::Concern</code> was added to Rails to
simplify the process:</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="k">module</span> <span class="nn">MyRailsExtension</span>
<span class="kp">extend</span> <span class="no">ActiveSupport</span><span class="o">::</span><span class="no">Concern</span>
<span class="n">included</span> <span class="k">do</span>
<span class="c1"># custom stuff</span>
<span class="k">end</span>
<span class="k">module</span> <span class="nn">ClassMethods</span>
<span class="k">def</span> <span class="nf">my_dsl_extension</span>
<span class="c1"># ...</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">module</span> <span class="nn">InstanceMethods</span>
<span class="c1"># ...</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="k">class</span> <span class="nc">MyModel</span>
<span class="kp">include</span> <span class="no">MyRailsExtension</span>
<span class="c1"># ...</span>
<span class="k">end</span></code></pre></div>
<p>This will automatically <code>extend ClassMethods</code> and <code>include InstanceMethods</code>
and evaluate whatever code is put into the <code>included do ... end</code> block.
This module has been in the Rails code base for about 2 years now, and it
bothers me. I get that</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="k">class</span> <span class="nc">MyModel</span>
<span class="kp">include</span> <span class="no">MyRailsExtension</span>
<span class="k">end</span></code></pre></div>
<p>is more terse than</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="k">class</span> <span class="nc">MyOtherModel</span>
<span class="kp">extend</span> <span class="no">MyRailsExtension</span><span class="o">::</span><span class="no">ClassMethods</span>
<span class="kp">include</span> <span class="no">MyRailsExtension</span><span class="o">::</span><span class="no">InstanceMethods</span>
<span class="k">end</span></code></pre></div>
<p>but the latter gives us details that especially helpful when we need to
track down what an extension is really doing.</p>
<p>Given the frequency with which this idiom is used in the Rails world, I am
probably voicing a minority opinion — as time goes on I find myself
embracing more Rubyisms than Railsisms — but this is my self-serving
blog. I’ll piss and moan on the topics I want to piss and moan about.</p>
Properties of Code: Functional Complexity2011-05-08T00:00:00-04:00http://mathish.com/2011/05/08/mother-functional<p>About a year ago, I began giving some serious thought to an article named
<a href="https://github.com/raganwald/homoiconic/blob/master/2009-06-02/functional_complexity.md">Functional Complexity Modulo a Test Suite</a>
by <a href="http://raganwald.posterous.com/">Reg Braithwaite</a>. Today, I think I have
something to say on the matter.</p>
<h3 id="background">Background</h3>
<p>Suppose you’re a programmer embracing the trends of test or behavior driven
development. You have a problem to solve and expectations on how the
solution should behave. So, you think about the problem, enumerate the
behaviors and write some tests to model this version of reality. You start
writing code to satisfy these tests. Red becomes green, you refactor clumsy
first drafts into terse and expressive statements that are almost a pleasure
to read, when without warning, four years of undergraduate mathematics grab
ahold of your brain and thrust you into theoretical domains. Your output of
practical, working code is halted for the day, while you begin contemplating
ways to measure “readability”, “maintainability”, “complexity” and how these
ideas fit within the framework you have spent the past few years operating
within. You immediately regret taking that second major in mathematics, but
it’s an integral part of you now. You know you can’t run from it, so you do
your best to appease it by firing up your favorite text editor, ensuring that
your blog is running <a href="http://www.mathjax.org/">MathJax</a>, and whipping out
your \LaTeX fu!</p>
<h3 id="some-preliminary-definitions">Some Preliminary Definitions</h3>
<p>The following definitions have been directly derived from the original
article linked above.</p>
<p>For our purposes, a <dfn>program</dfn>, p, is a function of inputs that
returns some kind of output. We will use \mathcal{P} to denote
the set of all programs.</p>
<p>A <dfn>test</dfn>, t, is a function that maps a program, p,
to either 1, for true, or 0, for false: t : \mathcal{P} \rightarrow \{0, 1\}.</p>
<p>A program, p, is said to <dfn>satisfy</dfn> a test,
t, if and only if t(p) = 1. We will represent
this relationship with the symbol \models:</p>
p \models t \iff t(p) = 1
<p>A test suite, \sigma = \{ t_1, t_2, \ldots, t_n \}, is a set of
n tests. We denote the number of tests in a suite as
|\sigma| = n. A program is said to <dfn>satisfy</dfn> a test suite,
\sigma if it satisfies each of the tests in \sigma.
More formally, p \models \sigma if, and only if
\forall t \in \sigma, p \models t.</p>
<p>We say two programs are <dfn>functionally congruent modulo a test
suite</dfn> when they both satisfy the same test suite. When there is no
danger of ambiguity, we may simply refer to two (or more) programs as being
“congruent.” We represent this notion of congruence as follows:</p>
p_1 \equiv p_2 (\bmod \sigma) \iff p_1 \models \sigma, p_2 \models \sigma.
<p>Consider all of the programs that might satisfy the test suite. There are an
infinite number of programs in this set (a trivial proof of this statement can
be found in “<a href="#example-programs-and-calculations">Example Programs and Calculations</a>.”)
Now, let’s suppose we have a metric, |p|, that measures the
size of a program. As a trivial example, let’s suppose |p| is
the number of characters in the string representation of program p.
We could use a more interesting measurement,
but for now let’s stick with the simple “string length” measurement. We
take this measuring stick and apply it to each of the programs in the set
of programs that satisfy a given test suite, and record the “shortest length”
measured. This “shortest length” is the <dfn>satisfaction complexity</dfn> of
the test suite. Given a metric for a program |p|,
the satisfaction complexity of a test suite, D_{\sigma}, is
given by,</p>
D_{\sigma} = \min \left\{ |p| : p \models \sigma \right\}
<p>And finally, we define the <dfn>functional complexity of a program,
modulo a test suite</dfn>, F_{\sigma}(p), to be the satisfaction
complexity of the test suite, D_{\sigma}, if and only if the
program satisfies the test suite.
Alternatively,</p>
F_{\sigma}(p) = \left\{
\begin{array}{ll}
D_{\sigma} & \text{if \(p \models \sigma\)} \\
\infty & \text{otherwise}
\end{array}
\right.
<p>The important point to take away from this is that every program that
satisfies a given test suite has the same functional complexity relative to
that test suite.</p>
<h3 id="example-programs-and-calculations">Example Programs and Calculations</h3>
<p>Suppose we want a program that takes the sum of all of the integers from
1 to k. We begin by writing some tests (note: I am going to
make use of a hypothetical <code>assert</code> function that returns 1 if the block it
is given evaluates to true, and 0 otherwise):</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="k">def</span> <span class="nf">test_1</span> <span class="n">program</span>
<span class="n">assert</span> <span class="p">{</span> <span class="n">program</span><span class="o">.</span><span class="n">call</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> <span class="o">==</span> <span class="mi">55</span> <span class="p">}</span>
<span class="k">end</span>
<span class="k">def</span> <span class="nf">test_2</span> <span class="n">program</span>
<span class="n">assert</span> <span class="p">{</span> <span class="n">program</span><span class="o">.</span><span class="n">call</span><span class="p">(</span><span class="mi">6</span><span class="p">)</span> <span class="o">==</span> <span class="mi">21</span> <span class="p">}</span>
<span class="k">end</span>
<span class="k">def</span> <span class="nf">test_3</span> <span class="n">program</span>
<span class="n">assert</span> <span class="p">{</span> <span class="n">program</span><span class="o">.</span><span class="n">call</span><span class="p">(</span><span class="mi">83</span><span class="p">)</span> <span class="o">==</span> <span class="mi">3486</span> <span class="p">}</span>
<span class="k">end</span></code></pre></div>
<p>Our test suite consists of three tests, each checking that our program
produces the correct sum for distinct values of k. As
mentioned earlier, there are an infinite number of programs that can satisfy
this test suite, and here is the trivial proof:</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="k">def</span> <span class="nf">sum_with_while</span> <span class="n">k</span>
<span class="n">i</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">while</span> <span class="n">i</span> <span class="o"><=</span> <span class="n">k</span>
<span class="n">sum</span> <span class="o">+=</span> <span class="n">i</span>
<span class="n">i</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">end</span>
<span class="n">sum</span>
<span class="k">end</span>
<span class="k">def</span> <span class="nf">sum_with_while_1</span> <span class="n">k</span>
<span class="n">unused_local</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">i</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">while</span> <span class="n">i</span> <span class="o"><=</span> <span class="n">k</span>
<span class="n">sum</span> <span class="o">+=</span> <span class="n">i</span>
<span class="n">i</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">end</span>
<span class="n">sum</span>
<span class="k">end</span>
<span class="k">def</span> <span class="nf">sum_with_while_2</span> <span class="n">k</span>
<span class="n">unused_local</span> <span class="o">=</span> <span class="mi">2</span>
<span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">i</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">while</span> <span class="n">i</span> <span class="o"><=</span> <span class="n">k</span>
<span class="n">sum</span> <span class="o">+=</span> <span class="n">i</span>
<span class="n">i</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">end</span>
<span class="n">sum</span>
<span class="k">end</span></code></pre></div>
<p>Hopefully, the pattern is obvious: the program <code>sum_with_while_<x></code> will set
<code>unused_local = <x></code>. This assignment adds nothing of value to the program,
but the program still returns the appropriate result, and thus satisfies the
test suite.</p>
<p>Now, let’s consider a few non-trivial variations:</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="n">p1</span> <span class="o">=</span> <span class="nb">lambda</span> <span class="k">do</span> <span class="o">|</span><span class="n">k</span><span class="o">|</span>
<span class="n">i</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">sum</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">while</span> <span class="n">i</span> <span class="o"><=</span> <span class="n">k</span>
<span class="n">sum</span> <span class="o">+=</span> <span class="n">i</span>
<span class="n">i</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">end</span>
<span class="n">sum</span>
<span class="k">end</span>
<span class="n">p2</span> <span class="o">=</span> <span class="nb">lambda</span> <span class="k">do</span> <span class="o">|</span><span class="n">k</span><span class="o">|</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">b</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">while</span> <span class="n">a</span> <span class="o"><</span> <span class="n">k</span>
<span class="n">b</span> <span class="o">+=</span> <span class="p">(</span><span class="n">a</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">end</span>
<span class="n">b</span>
<span class="k">end</span>
<span class="n">p3</span> <span class="o">=</span> <span class="nb">lambda</span> <span class="k">do</span> <span class="o">|</span><span class="n">k</span><span class="o">|</span>
<span class="p">(</span><span class="mi">1</span><span class="o">.</span><span class="n">.k</span><span class="p">)</span><span class="o">.</span><span class="n">inject</span> <span class="p">{</span> <span class="o">|</span><span class="n">sum</span><span class="p">,</span><span class="n">i</span><span class="o">|</span> <span class="n">sum</span> <span class="o">+</span> <span class="n">i</span> <span class="p">}</span>
<span class="k">end</span>
<span class="c1"># Only when Symbol#to_proc is available (eg: Ruby 1.8.7+)</span>
<span class="n">p4</span> <span class="o">=</span> <span class="nb">lambda</span> <span class="k">do</span> <span class="o">|</span><span class="n">k</span><span class="o">|</span>
<span class="p">(</span><span class="mi">1</span><span class="o">.</span><span class="n">.k</span><span class="p">)</span><span class="o">.</span><span class="n">inject</span><span class="p">(</span><span class="o">&</span><span class="ss">:+</span><span class="p">)</span>
<span class="k">end</span>
<span class="c1"># Only if Enumerable#sum is available (eg: active_support)</span>
<span class="n">p5</span> <span class="o">=</span> <span class="nb">lambda</span> <span class="k">do</span> <span class="o">|</span><span class="n">k</span><span class="o">|</span>
<span class="p">(</span><span class="mi">1</span><span class="o">.</span><span class="n">.k</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span>
<span class="k">end</span>
<span class="n">p6</span> <span class="o">=</span> <span class="nb">lambda</span> <span class="k">do</span> <span class="o">|</span><span class="n">k</span><span class="o">|</span>
<span class="n">k</span> <span class="o">*</span> <span class="p">(</span><span class="n">k</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span>
<span class="k">end</span>
<span class="n">p7</span> <span class="o">=</span> <span class="nb">lambda</span> <span class="k">do</span> <span class="o">|</span><span class="n">k</span><span class="o">|</span>
<span class="k">case</span> <span class="n">k</span>
<span class="k">when</span> <span class="mi">10</span> <span class="k">then</span> <span class="mi">55</span>
<span class="k">when</span> <span class="mi">6</span> <span class="k">then</span> <span class="mi">21</span>
<span class="k">when</span> <span class="mi">83</span> <span class="k">then</span> <span class="mi">3486</span>
<span class="k">end</span>
<span class="k">end</span></code></pre></div>
<p>Each of these programs, p_1 \ldots p_7, satisfy our test suite.
Programs p_1 and p_2 are very similar, but with
some variables renamed and some operations switched about. Programs
p_3, p_4, and p_5 are also
similar: p_4 removes some verbosity by using Ruby’s
<code>Symbol#to_proc</code> method while p_5 makes use of ActiveSupport’s
<code>Enumerable#sum</code> method which in turn calls <code>inject</code>. The program
p_6 calculates the sum analytically without iteration while
program p_7 provides results only for the values tested for.</p>
<p>We will ignore the variable assignment and line ending characters when
calculating the length of these programs. For example, when calculating the
length of p_6, we count only the characters in “lambda do |k|”
(13 characters), “ k * (k + 1) / 2” (17 characters, including the
two leading spaces), and “end” (3 characters), for a total of 33 characters.
We ignore line ending characters (eg: <code>\n</code>) because they aren’t readily
visible. Why make the process of verifying these numbers more tedious than
it already is?</p>
<p>Below are the lengths of each of the programs based upon this method of
counting characters:</p>
\begin{aligned}
|p_1| & = 78 \\
|p_2| & = 65 \\
|p_3| & = 51 \\
|p_4| & = 36 \\
|p_5| & = 28 \\
|p_6| & = 33 \\
|p_7| & = 81
\end{aligned}
<p>We see that p_5 is the shortest of our programs, weighing in
at 28 characters. One could argue that the size of p_5 is
misrepresented because the <code>sum</code> method is not a native Ruby method, and we
could address that concern by adding the length of the definition of <code>sum</code> to
|p_5|. When measured in the same way as our programs, the
<code>Enumerable#sum</code> method found in ActiveSupport 3.0.7 weighs in at 112
characters, so let’s tack that on, |p_5| = 140. Ensuring that
the metric accounts for the program’s definition as well as any external
dependencies keeps our measurements meaningful. Otherwise, we could create
a very small program that satisfies our test suite:</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="n">p8</span> <span class="o">=</span> <span class="nb">lambda</span> <span class="k">do</span> <span class="o">|</span><span class="n">k</span><span class="o">|</span>
<span class="n">p1</span><span class="o">[</span><span class="n">k</span><span class="o">]</span>
<span class="k">end</span></code></pre></div>
<p>If our notion of size does not account for the size of all external
dependencies, our metric (and that which we intend to build upon it) loses
nearly all utility.</p>
<p>Taking into consideration the adjustment made to |p_5|, the
“smallest” example program that satisfies our test suite is
p_6. It is entirely possible that there exist programs even
smaller that also satisfy the suite, but given that
p \models \sigma we can conclude</p>
\begin{aligned}
D_{\sigma} & \leq |p_6| \\
F_{\sigma}(p) & \leq |p_6| \\
& \leq 33
\end{aligned}
<p>So, we only have an upper bound on D_{\sigma}? Close enough!</p>
<h3 id="kolmogorov-complexity-and-functional-complexity">Kolmogorov Complexity and Functional Complexity</h3>
<p>One way of measuring the complexity of a string is by measuring its
<a href="http://en.wikipedia.org/wiki/Kolmogorov_complexity">Kolmogorov complexity</a>.
A quick overview of the process, taken straight from the linked Wikipedia
article, is to take a string:</p>
<pre><code>abababababababababababababababababababababababababababababababab
</code></pre>
<p>and search for a smaller representation of the it:</p>
<pre><code>ab repeated 32 times
</code></pre>
<p>The first string has 64 characters, the second string has only 20. Our
simplification of the original string may not be minimal, but it is
substantially smaller, which suggests that the original string was not very
complex. The actual Kolmogorov complexity of a string is the size of its
minimal representation in some fixed universal description language. Provided
that our language of choice is Turing complete, our measurements will vary
from some other choice in language by a fixed constant. So, let’s pick Ruby.</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="s1">'ab'</span><span class="o">*</span><span class="mi">32</span></code></pre></div>
<p>We now have a representation of our original string that is only 7 characters
long. This representation may still not be minimal, but as with Functional
complexity, we now have an upper bound — the Kolmogorov complexity of
the original string in Ruby is at most 7. It’s been a while since I’ve thrown
down some \LaTeX, so if we let K(s) represent
the Kolmogorov complexity of a string, s, and d(s)
represent a minimal description of s in Ruby, then:</p>
\begin{aligned}
K(s) & = |d(s)| \\
& \leq 7
\end{aligned}
<p>So, unsurprisingly, or original string is really not that complex, it can be
greatly compressed, and it is very “un-random.” All three of those statements
are roughly synonymous. Now, what is the relationship, if any, between
Kolmogorov complexity and Functional complexity modulo a test suite?
Kolmogorov complexity deals with individual strings whereas Functional
complexity deals in programs that satisfy a test suite, so to find a
relationship between the two, we need to get them both working in the same
domain. In our example test suite, we have a pretty simple mapping between
input and expected output that can be represented by many different strings.
Here’s one:</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="s2">"10=55,6=21,83=3486"</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">','</span><span class="p">)</span><span class="o">.</span><span class="n">each_with_index</span> <span class="k">do</span> <span class="o">|</span><span class="n">io</span><span class="p">,</span> <span class="n">i</span><span class="o">|</span>
<span class="n">i</span><span class="p">,</span> <span class="n">o</span> <span class="o">=</span> <span class="n">io</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">'='</span><span class="p">)</span>
<span class="n">define_method</span> <span class="ss">:"test_</span><span class="si">#{</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="ss">"</span> <span class="k">do</span> <span class="o">|</span><span class="nb">p</span><span class="o">|</span>
<span class="n">assert</span> <span class="p">{</span> <span class="nb">p</span><span class="o">[</span><span class="n">i</span><span class="o">.</span><span class="n">to_i</span><span class="o">]</span> <span class="o">==</span> <span class="n">o</span><span class="o">.</span><span class="n">to_i</span> <span class="p">}</span>
<span class="k">end</span>
<span class="k">end</span></code></pre></div>
<p>This test suite builder weighs in at 159 characters (excluding <code>\n</code>
characters) compared to 169 characters in our original test suite. We could
use a regular expression instead of multiple calls to <code>String#split</code>:</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="s2">"10=55,6=21,83=3486"</span><span class="o">.</span><span class="n">scan</span><span class="p">(</span><span class="sr">/(\d+)=(\d+)/</span><span class="p">)</span><span class="o">.</span><span class="n">each_with_index</span> <span class="k">do</span> <span class="o">|</span><span class="p">(</span><span class="n">i</span><span class="p">,</span><span class="n">o</span><span class="p">),</span> <span class="n">k</span><span class="o">|</span>
<span class="n">define_method</span> <span class="ss">:"test_</span><span class="si">#{</span><span class="n">k</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="ss">"</span> <span class="k">do</span> <span class="o">|</span><span class="nb">p</span><span class="o">|</span>
<span class="n">assert</span> <span class="p">{</span> <span class="nb">p</span><span class="o">[</span><span class="n">i</span><span class="o">.</span><span class="n">to_i</span><span class="o">]</span> <span class="o">==</span> <span class="n">o</span><span class="o">.</span><span class="n">to_i</span> <span class="p">}</span>
<span class="k">end</span>
<span class="k">end</span></code></pre></div>
<p>to bring our length down to 147 characters. The best part of this little
excursion is that none of the numbers I’ve thrown at you are important, I just
like to count things. What really matters is that you now see how
<code>10=55,6=21,83=3486</code> serves as a complete representation of our original
test suite. We want to measure this string’s Kolmogorov complexity.</p>
<p>None of the programs we’ve written to satisfy our test suite will generate
the string we’re now after, so we need to wrap them in a loving adapater:</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="n">adapter</span> <span class="o">=</span> <span class="nb">lambda</span> <span class="k">do</span> <span class="o">|</span><span class="nb">p</span><span class="o">|</span>
<span class="o">[</span><span class="mi">10</span><span class="p">,</span><span class="mi">6</span><span class="p">,</span><span class="mi">83</span><span class="o">].</span><span class="n">map</span> <span class="p">{</span> <span class="o">|</span><span class="n">i</span><span class="o">|</span> <span class="s2">"</span><span class="si">#{</span><span class="n">i</span><span class="si">}</span><span class="s2">=</span><span class="si">#{</span><span class="nb">p</span><span class="o">[</span><span class="n">i</span><span class="o">]</span><span class="si">}</span><span class="s2">"</span> <span class="p">}</span><span class="o">.</span><span class="n">join</span> <span class="s1">','</span>
<span class="k">end</span></code></pre></div>
<p>Excluding line endings — Do I need me to keep repeating that? Let’s
assume it’s implied from here on out — our adapter is 63 characters
long. We can now represent one 20 character string as a string of at least
63 characters. In terms of compression, we’re doing it wrong, but fret not,
for soon things will get better. In the meantime, we now have a program
that takes old programs and turns them into new programs capable of producing
the string we seek: p^{\prime} = A(p). Taking
p_2 as an example, we can do it all in Ruby with roughly 128
characters. <span id="kc-iff-1">Hopefully</span> it is clear that if
p \models \sigma, then p^{\prime} generates our
desired string.</p>
<p>Now comes the improvement to our compression fail: when measuring Kolmogorov
complexity, we are free to pick our language. Instead of using Ruby, we
could use Python or Pascal. We are also free to pick a superset of Ruby, say
Ruby + <code>adapter</code>. We’re going to be little pickier than that, though. Our
language of choice is the one where every program written is fed into the
adapter. With this restriction, the program:</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="nb">lambda</span> <span class="p">{</span> <span class="o">|*</span><span class="n">_</span><span class="o">|</span> <span class="s2">"10=55,6=21,83=3486"</span> <span class="p">}</span></code></pre></div>
<p>will not produce the desired string. Instead, it will be wrapped by <code>adapter</code>
and evaluate to:</p>
<pre><code>10=10=55,6=21,83=3486,6=10=55,6=21,83=3486,83=10=55,6=21,83=3486
</code></pre>
<p>Thus, if our wrapped program, p^{\prime}, produces the desired
string, then our original program satisfies the original test suite. Combine
this with our <a href="#kc-iff-1">earlier statement</a>, and we’ve got an equivalence:</p>
p \models \sigma \iff p^{\prime} = s,
<p>where s is our encoded test suite <code>10=55,6=21,83=3486</code>. Thus,
given a minimal description, d(s), of our encoded test suite
s in this adapter wrapped Ruby language, we can infer that
d(s) \models \sigma (perhaps with the help of <code>eval</code>.) We also
know that D_{\sigma} \approx d(s), since d(s) is
minimal in length. So, |d(s)| gives us our Kolmogorov
complexity and our <a href="#note-approximate-functional-complexity" id="ref-approx">
Functional complexity*</a>.</p>
<p>From our previous excursions in counting characters and whitespace (but not
newlines!), we see that all of the programs that satisfy the test suite are
longer than the 20 characters. Bummer, we still fail at compression. But
what happens if we add a few more tests? Let’s expand our encoded test suite
to the following:</p>
<pre><code>10=55,6=21,83=3486,99=4950,1019=519690,9001=40513501,15146=114708231
</code></pre>
<p>Now our desired string is 68 characters long. We know that
|d(s)| \leq |p_6| = 33, and we have finally un-failed at
compression! In addition to finding a representation for the string that is
half as long, we also kicked program p_7 out of the set of
programs that satisfy our test suite. Keeping with the spirit of how
p_7 satisfies the test suite, we can get it back in line with
the following change:</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="n">p7</span> <span class="o">=</span> <span class="nb">lambda</span> <span class="k">do</span> <span class="o">|</span><span class="n">k</span><span class="o">|</span>
<span class="k">case</span> <span class="n">k</span>
<span class="k">when</span> <span class="mi">10</span> <span class="k">then</span> <span class="mi">55</span>
<span class="k">when</span> <span class="mi">6</span> <span class="k">then</span> <span class="mi">21</span>
<span class="k">when</span> <span class="mi">83</span> <span class="k">then</span> <span class="mi">3486</span>
<span class="k">when</span> <span class="mi">99</span> <span class="k">then</span> <span class="mi">4950</span>
<span class="k">when</span> <span class="mi">1019</span> <span class="k">then</span> <span class="mi">519690</span>
<span class="k">when</span> <span class="mi">9001</span> <span class="k">then</span> <span class="mi">40513501</span>
<span class="k">when</span> <span class="mi">15146</span> <span class="k">then</span> <span class="mi">114708231</span>
<span class="k">end</span>
<span class="k">end</span></code></pre></div>
<p>In adjusting p_7 to satisfy the new test suite, we have
increased its length to 175 characters. So when our test suite was 20
characters long, p_7 satisfied it in 81 characters. When
the test suite grew by 48 characters, p_7 was forced to
grow by 94 characters. Changing a test changes a program… I smell a
potentially useful metric in there, and we will dig deeper into this next
time.</p>
<p>The chosen problem of summing the integers from 1 to k is
certainly a trivial one, and because of its simplicity, encoding the test
suite as a string was also pretty easy. However, with the right <code>adapter</code>
we could encode any test suite as a string even if each test spends a lot of
time setting up the initial state before verifying its expectations.</p>
<p>The goal here was to show a relationship between Kolmogorov and Functional
complexity measurements, so we tried to keep the components simple and
manageable. In doing so, we were able to find a direct relationship between
the two by considering a description language that, informally, is the result
of augmenting our host language (Ruby) with the expectations of our test suite
(this is how the <code>adapter</code> was constructed.) With a dash of mathematics and
a pinch of hand-waving, we have shown that if Kolmogorov complexity is a
meaningful measurement of a program then so to is Functional complexity
modulo a test suite.</p>
<h3 id="a-taste-of-things-to-come">A Taste of Things to Come</h3>
<p>I had intended to explore the relationship between these measures of
complexity and somewhat vague notions such as “readability” and
“maintainability” in this article, but it’s already quite a bit longer than
I had anticipated. I will definitely be talking about “maintainability”
next time: I believe there is a relationship between how much code must
change when test suites are modified, but I need some time to think more
about the math (I may have to whip out calculus or difference equations.)
I would also like to explore “readability,” though I may refer to it as
“comprehensibility” instead, by inverting a lot of the work done in this
article, but I may have to turn this series of articles into a trilogy to do
so. Stay tuned… or don’t, I’m still going to write it anyway!</p>
<h3 id="foot-noted">Foot Noted</h3>
<h4 id="note-approximate-functional-complexity">Note: Approximate Functional Complexity</h4>
<p>I played a little fast and loose with notation earlier. The description
string, d(s), is just that a <em>string</em>. As I mentioned, we may
need the help of <code>eval</code> to map it from the world of characters to the domain
of programs, so D_{\sigma} may be a few characters longer than
|d(s)|. However, the number of additional characters is
constant, so I’m okay with saying the two measures of complexity are roughly
equivalent.
[ <a href="#ref-approx">jump back</a> ]</p>
Let's Get Wonky!2011-05-05T00:00:00-04:00http://mathish.com/2011/05/05/lets-get-wonky<p>So, the email I (and scads of other people) got from Slicehost the other day
was pretty uninformative, referenced a “forum” for continuing the conversation
and didn’t link to it, and did very little to ease many of the anxieties it
raised. After determining that the email (with a Rack Space letter head) was
referring to the Slicehost forum, I spent some time reading up on the coming
changes, and was underwhelmed. There’s a lot of re-assurances, a fair bit
of hand-waving and talk of everything moving to the Rack Space Cloud. Clouds
are grand, what with their puffy and ephemeral qualities, but I’m just after
simple web hosting. This site uses almost no real bandwidth (a testament
to my inability or un-desire to actually engage an audience) and serves up
static HTML. My requirements are pretty minimal. I really don’t need the
ability to dynamically add instances to handle the load.</p>
<p>Don’t get me wrong, cloud computing interests me and I can see its utility,
just not here. Not for a static site that Jekyll lovingly compiles into
HTML every time I <code>git push deploy</code>.</p>
<p>So I, and I suspect many others are doing the same, jumped ship. I’d heard
good things about <a href="http://www.linode.com/">Linode</a>, so I signed up. The
account management system is already familiar to me, and I get more resources
at $0.05 less each month.</p>
<p>Unfortunately, a new host comes with a new set of IP addresses, which means
updating DNS records, which in turn means stale DNS caches. You might have
problems getting here over the next couple days, or you might be running
with the new DNS entries mere moments after I push the changes through my
registrar. It’s hard to say, but eventually the updates will make their
way to you, so sit tight and we’ll return after this short break!</p>
<p>Please enjoy the following identity: e^{i\theta} = \cos\theta + i\sin\theta</p>
A Thought on Books in the Cloud2011-05-03T00:00:00-04:00http://mathish.com/2011/05/03/on-a-problem<h3 id="background">Background</h3>
<p>Suppose we want to create a service that allows authors to upload digital
books to be stored in “the cloud.” Let’s also suppose that each book weighs
in at 5 Mb, on average. We also want to compile some meta-data for each book
so we can categorize an author’s library, select an excerpt from each book to
serve as a good summary of its topic, and expose an author to others who write
about similar subjects. Finally, let’s suppose that we have attracted the
attention of scads of prolific authors, perhaps a million authors each having
a hundred books to their name. Assuming no duplicates, we’re looking at
about 475 Tb of data that need stored and processed.</p>
<h3 id="overview-of-a-solution">Overview of a Solution</h3>
<p>We begin with <a href="#figure-overview">an overview</a> of the components
involved in receiving and processing all of these books. In the figure,
concrete entities are represented by rounded rectangles while octagons are
used to represent pools or clusters of entities. Solid lines are used to
indicate the direction of communication between components. As a brief aside,
when we refer to “authors”, we are really referring to authors and the
software applications they are using to interact with our system.</p>
<figure class="display-mode" id="figure-overview">
<img src="/images/cloud-overview.png" alt="Figure: Interaction between components" />
<figcaption>Interaction between components.</figcaption>
</figure>
<p>A summary of the steps involved in uploading a book to the system are: </p>
<ol>
<li>An <code>Author</code> tells the <code>Dispatcher</code> that she has a book to upload.</li>
<li>The <code>Dispatcher</code> replies with the location of a <code>Receiver</code> that is ready
to handle the upload.</li>
<li>The <code>Author</code> connects to the given <code>Receiver</code> and begins transferring
the book which the <code>Receiver</code> writes to <code>File Storage</code>.</li>
<li>When the upload completes, an available <code>Extractor</code> goes to work on the
uploaded book, extracting the necessary meta-data.</li>
<li>The <code>Author</code> tells the <code>Dispatcher</code> that she has another book to upload
while the <code>Extractor</code> is culling information and storing it in
<code>Meta-Data Storage</code>.</li>
</ol>
<p>In the following sections, we will explore these interactions in greater
detail and suggest some possible improvements to this system.</p>
<h3 id="transfer-and-storage-of-the-books">Transfer and Storage of the Books</h3>
<p>The process begins when an author tells the central dispatcher that she has
some books to upload. The dispatcher then makes an informed decision as to
which receiver is available to handle the file transfer. The author connects
to the receiver and sends her books which the receiver happily stores in
“cloud” storage.</p>
<p>For simplicity, we’ll assume each receiver runs on its own server, virtual
instance, etc. Receiving a single file across the Internet is typically not
a CPU intensive operation because the CPU can process the data considerably
faster than it can be transferred. This means a single receiver is capable
of handling multiple uploads concurrently. The actual number of files a
receiver can safely handle depends on the amount of bandwidth available to
the server and the estimated transfer speed of an author.</p>
<p>When the dispatcher needs to find an available receiver to handle an upload,
it <a href="#note-topics-and-queues" id="ref-topics-and-queues">broadcasts*</a>
its request through the message queue. If a receiver thinks it can handle
the request it replies to the dispatcher, informing it of its current
load and bandwidth availability. The dispatcher picks a sufficient receiver
from these replies and routes the author’s request accordingly.</p>
<p>Once the upload has completed, the receiver passes a message to the extractors
letting them know that a new book is ready to be processed and where that book
resides within the file storage. The first available extractor consumes this
message and quickly gets to work.</p>
<p>The use of a message queue to handle event dispatching incurs more overhead
than if all of the components communicated directly; however there are some
important advantages to this approach that will be covered in the section
<a href="#the-value-of-a-message-queue">“The Value of a Message Queue”</a>.</p>
<h3 id="extracting-the-meta-data">Extracting the Meta-data</h3>
<p>Once a receiver has finished a transfer and written the book to the file
storage, an available extractor begins harvesting the meta-data contained
within the freshly uploaded book. The information it gathers is written to
the meta-data storage (perhaps a RDBMS, or a NoSQL document store if we
want more freedom in what constitutes meta-data.) When it finishes the
extraction, the author now has a categorized and excerpted book in the
system that connects them to other authors who have written something similar.</p>
<p>Again for simplicity, we’ll assume each extractor runs on its own
server or VM instance. While the system’s file storage is likely network
based, the transfer speeds will be considerably faster than the original
transfer across the internet. Further, unlike copying a file, the meta-data
extraction process is bound to require more computing power. As a result,
a single extractor will not be able to handle as many books as a single
receiver, but it may be able to handle several books concurrently, depending
upon the resources available to it.</p>
<p>If the meta-data is successfully extracted, the extractor will notify the
system of this event. This notification can be used to monitor overall
system performance or to let a web application know that it should update
the author’s profile. If the extractor fails to fully process the book, it
could pass a message to the other extractors letting them know what data still
needs extracted. If the book can’t be processed, for instance if it is in
an unknown format, the extractor can pass along this information so that
the author can be informed of the problem through an email, a web application
or through the software they are using to upload their books.</p>
<h3 id="the-value-of-a-message-queue">The Value of a Message Queue</h3>
<p>As mentioned earlier, using a message queue to handle all of the communication
between components does incur some overhead. Each message has to be sent
from one component to the broker where it is then dispatched to the other
components that are listening for the message. The messages may be small, but
there is still some non-zero propagation delay between one component sending
the notification and another component receiving the message and responding
to it. However, the additional flexibility a message queue offers will
outweigh this overhead, especially when we need to quickly add or remove
receivers and extractors.</p>
<p>If we take the intermediate broker out of our example, our dispatcher needs
to be directly aware of all of the receivers. Further, the receivers need to
be aware of all of the extractors. Typically this will require configuration
files mapping out all of these relationships, and should you want to add more
workers to the pool, you will need to update the config and deploy it to all
existing workers so your new workers can start sharing in the load.</p>
<p>By having the message queue in place, any particular worker only needs to
know where to find the message queue and the storage devices it needs to
directly interact with. Receivers only need to know where the message
queue is and where to store the books that they are receiving. Extractors
need to know about the message queue, the meta-data storage and where to
find books that have been uploaded. When a new worker comes online, it tells
the message queue what notifications it cares about through subscriptions.</p>
<p>In our hypothetical book storage system, the dispatcher will subscribe to
an “available receivers” topic, and broadcast messages to a
“request for receiver” topic when it needs to connect an author with a
receiver. Each receiver will subscribe to the “request for receiver” topic,
waiting for the dispatcher to ask it for help, and will pass a message to
a “book uploaded” queue every time a new book is uploaded. Finally, our
extractors subscribe to the “book uploaded” queue and a “retry extraction”
queue. As these are queues instead of topics, the messages passed here
will be consumed by one and only one extractor. If the extraction completes
successfully, the extractor can pass a message to “book completed” topic. If
the extraction could not be finished it can signal this through the
“retry extraction” queue, allowing another extractor to consume the message
and continue the work.</p>
<p>A big advantage here is that if a surge of new authors discover our service
and start uploading their works, we can spin up new workers to handle the
increase in demand without doing much work. The message queue maintains an
internal list of active sessions and their corresponding subscriptions so the
workers just need to know where to find the broker.</p>
<p>Another advantage to having this infrastructure in place is that it makes
introducing and responding to new events fairly trivial. Let’s say we want
our system to automatically grow and shrink in response to demand so if a
flood of new authors show up at 4 AM on a Saturday, no one’s trying to spin
up new virtual instances with a hangover. One way to accommodate this
feature would be to introduce a few system monitors, we’d want more than one
for redundancy’s sake. These monitors subscribe to the
“request for receiver,” “available receivers,” and “book completed” topics.
From here, they can measure how often the dispatcher is making requests for
receivers, what kind of load each available receiver is under and how long
it is taking the extractors to do their magic, and that’s enough information
to anticipate the need for more (or fewer, if demand drops) receivers and
extractors. The monitors can now spin up and spin down other instances
<a href="#note-monitors" id="ref-monitors">without changing*</a>
how the other components in the system operate.</p>
<h3 id="glossed-over-details-and-suggested-improvements">Glossed Over Details and Suggested Improvements</h3>
<h4 id="tracking-the-overall-job">Tracking the Overall Job</h4>
<p>The system outlined here performs its work based upon the production and
consumption of simple event messages passed through the message queue.
When the receiver finishes storing a new book, it only tells an extractor
where to find the file. We have not given the system a way to determine the
overall process of a particular “job.” Fortunately, most message queues allow
for headers to be attached to any message, and this will allow us to better
track where a book is in the processing pipeline. If the dispatcher gives
an author the address of a receiver and a session key when she starts the
uploading process, she can in turn provide the receiver with this key. Now,
every message the receiver passes to the queue will contain this session
key as a message header and all consumers (namely the extractors) of these
messages will copy that header to the messages they produce. From there, we
can either passively monitor the progress of a book by tracking the session
key header, or we can actively poll workers for status updates for a given
session key. The latter will require some additional logic to be built
in to our workers.</p>
<h4 id="working-in-bulk">Working in Bulk</h4>
<p>One way to reduce the relative chatter between components and the message
queue is for the system to prefer to work on batches of books whenever possible
instead of processing one book at a time. This doesn’t reduce the number of
messages being passed, but it does mean that fewer messages are being passed
per book.</p>
<h4 id="fault-tolerant-dispatching">Fault Tolerant Dispatching</h4>
<p>As outlined here, the dispatcher is a single point of failure. We can
improve upon this design by creating a dispatcher pool and using
<a href="http://www.tcpipguide.com/free/t_DNSNameServerLoadBalancing.htm">DNS load balancing</a>
to distribute the requests. Alternatively, it may be possible to remove the
need for dispatchers all together by having authors connect directly to
receivers by way of some content delivery network, though this option will
depend upon the particular “cloud service provider” being used.</p>
<h4 id="letting-authors-pull-their-weight">Letting Authors Pull Their Weight</h4>
<p>Generating a social graph for an author based upon the subjects of her works
requires information that is available only within our service. However,
personal categorization of her library and extracting excerpts from her books
can be done using only the information available on the author’s computer.
Why not let her computer share in the work? After her library has been
analyzed, she’ll transfer the book along with the meta-data produced locally,
giving our extractors more free time to spend with their families! Our
simplified message passing would need to be adjusted to accommodate this
scenario, of course, but being able to reduce the number of VM instances
probably warrants the adjustment.</p>
<p>Maybe we can go further. Authors probably like to keep their books organized,
maybe the software they use to write books allows them to add meta-data
elements like keywords, an abstract, and so forth. Wouldn’t it be nice if
the software they use to interact with our service could talk to the software
they already use to keep tabs on their books? Maybe their software doesn’t
have all of the meta-data we’re after, maybe its ability to organize books
is pretty limited (Kindle…) Our extractors may have to do some work, but
any meta-data we can provide before our system starts processing will
certainly reduce the time it takes an author to have their full library
available in the cloud.</p>
<h3 id="footnotes">Footnotes</h3>
<h4 id="note-topics-and-queues">Note: Topics and Queues</h4>
<p>The particulars will vary between message queue services,
but it is common for message queues to support two types of destinations:
<code>queues</code> and <code>topics</code>. Messages passed to a <code>queue</code> are delivered to a
single subscriber (a receiver in our example) while those passed to a
<code>topic</code> are delivered to all subscribers.
[ <a href="#ref-topics-and-queues">jump back</a> ]</p>
<h4 id="note-monitors">Note: Monitors</h4>
<p>I lied a little bit, though I’m choosing to call it a simplification. In
reality, if a monitor were to spin down a receiver while it was handling an
upload, the system we’ve been discussing would break. The monitor should,
instead, politely ask that a receiver stop taking new requests and when
said receiver has completed its current requests, it would gracefully shut
itself down. While we’re on the subject of omissions through simplification,
if a receiver were to fail to receive a book, we could have it send an
“upload failed” notification, providing the dispatcher with the opportunity
to notify the author that she needs to re-send the book to a newly determined
available receiver.
[ <a href="#ref-monitors">jump back</a> ]</p>
Events on a Train, Seeking a Ruby 1.9 World2011-04-24T00:00:00-04:00http://mathish.com/2011/04/24/events-on-a-train<p>So, true to my nature of nothing ever being quite “good enough” I’m already
looking to add new features to <a href="http://mdvlrb.com/onstomp/">OnStomp</a> as well
as making plans for what version 2.0 will look like.</p>
<h3 id="events-on-a-separate-loop">Events on a separate loop</h3>
<p>First, the new features, which is to say a new <strong>feature</strong>. One thing that’s
<a href="https://github.com/meadvillerb/onstomp/issues/8">been bothering me</a> is that
most events are dispatched from the IO thread of an <code>OnStomp::Client</code>
instance. This means that long-running (or a long chain of short
running) event handlers, once triggered, will have to finish running before
further IO processing can occur. Another issue is that if an exception is
raised in any of these callbacks, it will generally close the connection. In
either case, IO can be negatively impacted by the programming approach the
gem tries to encourage.</p>
<p>A second issue, slightly more subtle but just as significant, is that
not all events are triggered in the same thread. The events that get
triggered outside the IO processing thread are <code>before_transmitting</code> and
<code>before_<frame></code>. Let’s jump into an example:</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="n">main_thread</span> <span class="o">=</span> <span class="no">Thread</span><span class="o">.</span><span class="n">current</span>
<span class="n">client</span> <span class="o">=</span> <span class="no">OnStomp</span><span class="o">.</span><span class="n">connect</span> <span class="s2">"stomp://localhost"</span>
<span class="n">client</span><span class="o">.</span><span class="n">before_transmitting</span> <span class="k">do</span> <span class="o">|</span><span class="n">f</span><span class="p">,</span> <span class="o">*</span><span class="n">_</span><span class="o">|</span>
<span class="c1"># Thread.current == main_thread</span>
<span class="k">end</span>
<span class="n">client</span><span class="o">.</span><span class="n">before_send</span> <span class="k">do</span> <span class="o">|</span><span class="n">send_frame</span><span class="p">,</span> <span class="o">*</span><span class="n">_</span><span class="o">|</span>
<span class="c1"># Thread.current == main_thread</span>
<span class="k">end</span>
<span class="n">client</span><span class="o">.</span><span class="n">on_send</span> <span class="k">do</span> <span class="o">|</span><span class="n">send_frame</span><span class="p">,</span> <span class="o">*</span><span class="n">_</span><span class="o">|</span>
<span class="c1"># Thread.current != main_thread</span>
<span class="k">end</span>
<span class="n">client</span><span class="o">.</span><span class="n">after_transmitting</span> <span class="k">do</span> <span class="o">|</span><span class="n">f</span><span class="p">,</span> <span class="o">*</span><span class="n">_</span><span class="o">|</span>
<span class="c1"># Thread.current != main_thread</span>
<span class="c1"># The current thread is the same here as in 'on_send'</span>
<span class="k">end</span>
<span class="n">client</span><span class="o">.</span><span class="n">send</span> <span class="s2">"/queue/test"</span><span class="p">,</span> <span class="s2">"Hello World!"</span></code></pre></div>
<p>Now, <code>before_transmitting</code> and <code>before_send</code> will be invoked (in that order)
before the actual <code>SEND</code> frame is sent off to a dark and mysterious buffer
where the IO processor will eventually get around to writing it to the
socket. This means, you don’t have to worry about mutex locking and whatnot
between these two groups of events. However, it still displeases me, as event
handling will be split across two distinct threads.</p>
<p>So, to solve these issues I’m probably going to drop in a second thread.
There are a few issues that need careful consideration. I’ll need to ensure
that all <code>before_*</code> events are triggered before a client-generated frame
gets sent to the IO write buffer. Also, it would be nice if all events
triggered within the failover extension used the same thread as well by
sharing an event dispatcher amongst all of the clients in the pool. This
will keep the overall thread count down, and resolve some of its finer quirks
that appear to be the result of events being triggered in a particular
client’s IO processing thread.</p>
<p>In short, I’ll follow the lead of
<a href="http://www.youtube.com/watch?v=HLjS3gzHetA">Arthur “Two Sheds” Jackson</a>.</p>
<h3 id="welcome-to-the-world-of-tomorrow">Welcome to the world of tomorrow!</h3>
<p>I’m eagerly awaiting the day when JRuby has full Ruby 1.9.2 support,
including non-blocking IO for OpenSSL connections. On that day, OnStomp 2.0
will hit the shelves, and it will require Ruby 1.9+. I have no intention
of totally abandoning Ruby 1.8.7, and the OnStomp 1.0 branch will always
support Ruby 1.8.7. That said, I am still looking forward to dropping all of
the conditional code and strange shit like:</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="no">ENUMERATOR_KLASS</span> <span class="o">=</span> <span class="p">(</span><span class="no">RUBY_VERSION</span> <span class="o">>=</span> <span class="s1">'1.9'</span><span class="p">)</span> <span class="p">?</span> <span class="no">Enumerator</span> <span class="p">:</span> <span class="no">Enumerable</span><span class="o">::</span><span class="no">Enumerator</span></code></pre></div>
<p>It might even provide an opportunity to make use of <code>Fiber</code>s. It’s going to
be pretty sweet.</p>
<p>Next time, I’m getting off at Willoughby.</p>
Hello, Jekyll2011-04-23T00:00:00-04:00http://mathish.com/2011/04/23/hello-jekyll<p>I had been pretty happy using <a href="http://www.enkiblog.com/">Enki</a> to drive this
site. It’s minimal but sufficient, and very easy to tweak. However, one of
Enki’s major design decisions drove me into the loving arms of
<a href="https://github.com/mojombo/jekyll">Jekyll</a>: the use of OpenID for
authentication. To be fair, I’m mostly pissed at
<a href="https://www.myopenid.com/">myOpenID</a>, which has failed to authenticate me
several times over the last few months and today appears to have lost my
account entirely (sort of.) Nonetheless, Enki relies on OpenID for
authentication, and myOpenID divorced me, took the kids, and sold my house.</p>
<p>On the up side, moving to Jekyll gave me an excuse to play around with HTML5
and a new layout that I’ll no doubt be tweaking for some time to come.</p>
<p>If you’re using a browser that doesn’t support HTML5 elements, this new
layout will probably look like total bullshit, but that’s okay for a couple
reasons. First, I really do want to dive further into HTML5 and CSS3.
Second, I don’t think this site’s content warrants maximum cross-browser
compatibility.</p>
<p>That’s all for now, more to come soon now that I can blog with the combined
powers of <code>TextMate</code> and <code>git</code>.</p>
<p><em>Postscript</em>: I’m now using <a href="http://disqus.com/">disqus</a> for comments.
Hopefully I don’t end up regretting this decision.</p>
<p><em>Postscript 2</em>: I’ve deleted what was left of my myOpenID account. Farewell
OpenID, you were an interesting experiment, but I think your utility is going
to be relegated largely to the fringe.</p>
OnStomp 1.0.02011-04-03T00:00:00-04:00http://mathish.com/2011/04/03/onstomp-1-0-0<p>The <a href="https://github.com/meadvillerb/onstomp">OnStomp</a> gem version 1.0.0 has been
released. It deprecates my previous Ruby stomp client
<a href="https://github.com/iande/stomper">Stomper</a>.</p>
<p>OnStomp supports both the 1.0 and 1.1 STOMP protocols, sports an event-driven
interface and reads and writes sockets through non-blocking goodness.
You install the gem with</p>
<pre><code>gem install onstomp
</code></pre>
<p>While you’re at it, why not have a cup of coffee and peruse
<a href="http://mdvlrb.com/onstomp">the docs</a>. I’m in the process of writing user
and developer “narratives,” but the code itself has been documented with
<a href="http://yardoc.org/">YARD</a> (Yay!)</p>
<p>While I am the only contributor / author of the gem, I’m putting this under
the <a href="https://github.com/meadvillerb">meadvillerb</a> community to inflate its
codebase and maybe foster some Meadville Ruby action.</p>
<p>You can report any issues through the
<a href="https://github.com/meadvillerb/onstomp/issues">bugtracker</a>.</p>
<p>Note to self: roll with one markup language, using textile and markdown in
different contexts results in a lot of editing.</p>
Note to Self - Linux Server & Apple Clients2011-03-26T00:00:00-04:00http://mathish.com/2011/03/26/note-to-self-linux-server-apple-clients<p>Install <code>cups</code> and <code>splix</code> to get suport for the Samsung ML-2510 Laser
Printer.</p>
<p>Follow <a href="http://www.kremalicious.com/2008/06/ubuntu-as-mac-file-server-and-time-machine-volume/">Ubuntu as Mac File Server</a> howto.</p>
<p>Note: you do not need to build netatalk from source, also add <code>uams_dhx2.so</code>
and <code>uams_guest.so</code> to the config file <code>afpd.conf</code></p>
Stomper 2.1.maybe?2011-03-24T00:00:00-04:00http://mathish.com/2011/03/24/stomper-2-1-maybe<p>After a significant re-tooling of Stomper’s IO handling, I’ve got something
that seems very fast and very stable, using non-blocking IO.</p>
<p>I’m debating how I want to approach this, though. The changes are
significant enough to warrant incrementing the minor number, potentially even
the major number, rather than just the bugfix/patch number. However, there’s
also a python library named stomper, so we’ll see what happens as I revise
tests and verify that everything that should work does.</p>
<p>Some preliminary tests suggest I can relay 5,000 frames each with a 1kb body
in about 10 seconds, which pleases me.</p>
<p>So, I’m definitely leaning toward a rename, going from Stomper to OnStomp.</p>
Stomper 2.0.[3-5]2011-03-09T00:00:00-05:00http://mathish.com/2011/03/09/stomper-2-0-3-5<p>In working out the particulars of a
<a href="https://github.com/iande/stomper-failover">failover</a> extension to
<a href="https://github.com/iande/stomper">Stomper</a>, I discovered a few subtle but
frustrating bugs. As a result, I pushed out 3 versions of the
<a href="https://rubygems.org/gems/stomper">Stomper gem</a> today.
This may be the greatest accomplishment of my life.</p>
<p>At the risk of sounding crazy, I’ve replaced my toes with grapes.</p>
STOMP 1.12011-02-09T00:00:00-05:00http://mathish.com/2011/02/09/stomp-1-1<p>Every post I make to the stomp-spec group that isn’t a “+1” or “me, too!”
is often followed up with a thought of “you’re a moron” about 10 minutes
later. I’m going to improve this situation by spending some time hashing out
my thoughts here, and if they still look good a couple hours later, then I’ll
post them.</p>
<h3 id="connect-and-connected-header-issues">CONNECT and CONNECTED Header Issues</h3>
<p>STOMP 1.0 does not allow the octets ‘:’ and LF to appear in header names.
It also does not allow the LF octet in a header value. These conditions make
sense, ‘:’ serves as a delimiter between a header name and its value, and LF
serves as a delimiter between headers. However, this means certain
characters cannot be used within headers, and that can be a problem, for
example, if you want to stuff a JSON encoded string into a header value.</p>
<p>STOMP 1.1 rectifies this by specifying escape sequences for LF and ‘:’, ‘\n’
(two octets, not the single character \n) and ‘\c’ respectively.
Naturally, as we use ‘' as an escape sequence indicator, we need an escape
sequence for literal ‘' octets, and thus have ‘\’.</p>
<p>The problem is that until the protocol version is fully negotiated, header
escaping/unescaping may produce undesirable results. That is where the
CONNECT and CONNECTED frames come in to play.</p>
<h4 id="scenario-1">Scenario 1</h4>
<p>A client could send a CONNECT frame with escaped headers. Eg:</p>
<pre><code>CONNECT
login:DOMAIN\\me
passcode:s6842FW!\c4284\\$
some\cheader\cname:value\\of\nthe header
accept-version:1.1
^@
</code></pre>
<p>The client has indicated that it will only accept a 1.1 connection and has
used escaping appropriate for that version. The server, however, does not
know that the client will only accept version 1.1 until it has already read
the escaped header from the stream.</p>
<h4 id="scenario-2">Scenario 2</h4>
<p>A client could send a CONNECT frame that accepts either version 1.0 or 1.1,
and use only STOMP 1.0 compliant headers. However, even before the server
issues a CONNECTED frame, it already knows which version of the spec is going
to be used. The client, however, does not. If the server decides on STOMP
1.1, and sends a CONNECTED frame to the client such as:</p>
<pre><code>CONNECTED
session:D\chostname-63348-1297283114292-4\c0
some\cheader\cname:value\\of\nthe header
version:1.1
^@
</code></pre>
<p>The server already knows we’re using STOMP 1.1, but the client will not figure that out until it reads the ‘version’ header.</p>
<h4 id="resolutions">Resolutions</h4>
<ol>
<li>Header processing of CONNECT frames by the server has to be deferred until
‘accept-version’ is read. If list of versions include 1.0, it would be bad
form if the client escaped headers according to 1.1, so no unescaping need
be done. However, if the client is only version 1.1, the server may have
to unescape headers. Similarly, a client will have to defer processing of
all headers of a received CONNECTED frame until ‘version’ is read. From
the value given, the client can then unescape headers and values (1.1), or
leave them untouched (1.0). In either case, the spec may benefit from an
explicit statement regarding the treatment of CONNECT/CONNECTED headers.</li>
<li>Essentially follow Resolution 1, but require the first header of CONNECT
to be ‘accept-version’ and the first header of CONNECTED to be ‘version’,
this allows servers and clients to decide on the escaping rules to follow
after reading the first header.</li>
<li>Require that CONNECT/CONNECTED frames follow STOMP 1.0 header rules,
regardless of the desired protocol version (ie: no escape sequences get
generated, no unescaping gets done, LF and ‘:’ are not allowed in header
names, LF is not allowed in header values)</li>
</ol>
<p>Of these, Resolution 3 is the most appealing to me, because it doesn’t
require me to do anything else. Both Resolution 1 and Resolution 2 will
require some changes, though both are obviously doable. Resolution 1 doesn’t
require any further changes to the spec, except perhaps a warning about
deferring header decoding until you know what version is running. Resolution
2 is only helpful because of how I process headers on the stream, and is
probably just selfish.</p>
<p>Found the following in the <a href="http://stomp.github.com/stomp-specification-1.1.html#Protocol_Negotiation">Protocol Negotiation section</a>.</p>
<blockquote>
<p>The protocol that will be used for the rest of the session will be the
highest protocol version that both the client and server have in common.</p>
</blockquote>
<p>All that really remains then is clarifying when the “rest of the session”
begins (with the CONNECTED frame or after the CONNECTED frame.)</p>
Quirks of Array#each2011-02-01T00:00:00-05:00http://mathish.com/2011/02/01/quirks-of-array-each<p>I’m probably just late to the party, but at least with Ruby 1.8.7 and 1.9.2,
there’s nothing special that has to be done for each of these examples
to produce the same output:</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="o">[</span> <span class="o">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="o">]</span><span class="p">,</span> <span class="o">[</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="o">]</span><span class="p">,</span> <span class="o">[</span><span class="mi">7</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">9</span><span class="o">]</span> <span class="o">].</span><span class="n">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">trip</span><span class="o">|</span>
<span class="nb">puts</span> <span class="s2">"</span><span class="si">#{</span><span class="n">trip</span><span class="o">[</span><span class="mi">0</span><span class="o">]</span><span class="si">}</span><span class="s2"> / </span><span class="si">#{</span><span class="n">trip</span><span class="o">[</span><span class="mi">1</span><span class="o">]</span><span class="si">}</span><span class="s2"> / </span><span class="si">#{</span><span class="n">trip</span><span class="o">[</span><span class="mi">2</span><span class="o">]</span><span class="si">}</span><span class="s2">"</span>
<span class="k">end</span>
<span class="o">[</span> <span class="o">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="o">]</span><span class="p">,</span> <span class="o">[</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="o">]</span><span class="p">,</span> <span class="o">[</span><span class="mi">7</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">9</span><span class="o">]</span> <span class="o">].</span><span class="n">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">a</span><span class="p">,</span><span class="n">b</span><span class="p">,</span><span class="n">c</span><span class="o">|</span>
<span class="nb">puts</span> <span class="s2">"</span><span class="si">#{</span><span class="n">a</span><span class="si">}</span><span class="s2"> / </span><span class="si">#{</span><span class="n">b</span><span class="si">}</span><span class="s2"> / </span><span class="si">#{</span><span class="n">c</span><span class="si">}</span><span class="s2">"</span>
<span class="k">end</span>
<span class="o">[</span> <span class="o">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="o">]</span><span class="p">,</span> <span class="o">[</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="o">]</span><span class="p">,</span> <span class="o">[</span><span class="mi">7</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">9</span><span class="o">]</span> <span class="o">].</span><span class="n">each</span> <span class="k">do</span> <span class="o">|</span><span class="p">(</span><span class="n">a</span><span class="p">,</span><span class="n">b</span><span class="p">,</span><span class="n">c</span><span class="p">)</span><span class="o">|</span>
<span class="nb">puts</span> <span class="s2">"</span><span class="si">#{</span><span class="n">a</span><span class="si">}</span><span class="s2"> / </span><span class="si">#{</span><span class="n">b</span><span class="si">}</span><span class="s2"> / </span><span class="si">#{</span><span class="n">c</span><span class="si">}</span><span class="s2">"</span>
<span class="k">end</span></code></pre></div>
<p>I knew that wrapping the block parameters in parentheses worked when arrays
were yielded to the block. I did not realize that Ruby would automatically
do this when presented with a block that takes multiple parameters.</p>
mathlib Found2011-01-27T00:00:00-05:00http://mathish.com/2011/01/27/mathlib-found<p>Roughly 5-6 years ago, I made an effort to take Fractal rendering code I
developed during college and refactor it into a general purpose Java library,
<a href="https://github.com/iande/mathlib">mathlib.jar</a>. A hard drive failure and
desktop replacement later, and I had assumed that code was lost. In fact,
I seem to recall a fight between my former wife and I over the matter, but at
any rate, it appears I was wrong.</p>
<p>I doubt this code is of much utility to anyone, even I probably won’t get
much direct use out of it as I don’t do much with Java anymore. However,
there were some novel bits in there, and implementations of a number of
non-trivial functions extended to
f : \mathbb{C} \rightarrow \mathbb{C}. At the very least, it will
probably help with my work on implementing a Mandelbrot renderer in HTML5
using <code><canvas></code> and <code>WebWorker</code>s.</p>
Mandelbrot Sets in Javascript2011-01-17T00:00:00-05:00http://mathish.com/2011/01/17/mandelbrot-sets-in-javascript<p>I’m a big fan of fractals. From
<a href="http://en.wikipedia.org/wiki/L-system">Lindenmayer Systems</a> to variations on
the <a href="http://en.wikipedia.org/wiki/Mandelbrot_set">Mandelbrot set</a>, they all
have a special place in a statistically self-similar region of my brain.
Archimedes can keep his circles, I’ll stick with the striking complexity of
chaos. Given this mild obsession, it should not come as surprise that one of
the first applications I enjoy making when working with a new GUI environment
is a fractal generator. With the addition of Web Workers and programmatic
drawing via <code><canvas></code> elements in modern JS implementations, I find the past
repeating itself with affine self-similarity.</p>
<p>I am currently writing the rendering code, relying on
<a href="http://jquery.com">jQuery</a> and the aforementioned recent javascript
additions. For those that do not share in my fractal-eroticism, what I
produce can still be useful as an example of using Web Workers for
semi-concurrent javascript code execution.</p>
pg gem on OS X2011-01-15T00:00:00-05:00http://mathish.com/2011/01/15/pg-gem-on-snow-leopard<h3 id="building-pg-gem-on-os-x-snow-leopard">Building <code>pg</code> gem on OS X Snow Leopard</h3>
<p>After PostgreSQL has been installed with ports, <code>pg_config</code> and other tools
can be found in <code>"/opt/local/lib/postgresql84/bin"</code>. Add this path to
<code>$PATH</code> and set <code>ARCHFLAGS="-arch x86_64"</code>.</p>
<p>These can be set within the same command line when installing the gem:</p>
<pre><code>PATH=${PATH}:/opt/local/lib/postgresql84/bin ARCHFLAGS="-arch x86_64" gem install pg
</code></pre>
<p>They can also be set in their own right:</p>
<pre><code>export PATH=${PATH}:/opt/local/lib/postgresl84/bin
export ARCHFLAGS="-arch x86_64"
bundle
</code></pre>
OpenSSL - Brief Notes2011-01-15T00:00:00-05:00http://mathish.com/2011/01/15/openssl-brief-notes<p>When setting up OpenSSL validation in Ruby, I ran into a few issues. I’ll
revisit this post later, but for my own memory, here’s the big one:</p>
<p>Use the <code>openssl</code> command that matches the version that Ruby was built
against. This isn’t a big issue for the most part, but the <code>c_rehash</code>
command which creates symlinks to certs based on a hash relies on different
hashing techniques in OpenSSL 0.9.x and 1.x</p>
<p>Also, migrating keys between <code>openssl</code> and Java’s <code>keytool</code> is a lot like
having teeth pulled but without the novocaine and prescription pain killers
afterward.</p>
<p>Some links that were useful in this whole process:</p>
<ul>
<li><a href="http://www.ibm.com/developerworks/linux/library/l-openssl.html">http://www.ibm.com/developerworks/linux/library/l-openssl.html</a></li>
<li><a href="http://conshell.net/wiki/index.php/Keytool_to_OpenSSL_Conversion_tips">http://conshell.net/wiki/index.php/Keytool_to_OpenSSL_Conversion_tips</a></li>
<li><a href="http://activemq.apache.org/how-do-i-use-ssl.html">http://activemq.apache.org/how-do-i-use-ssl.html</a></li>
<li><a href="https://github.com/ruby/ruby/blob/trunk/sample/openssl/echo_cli.rb">https://github.com/ruby/ruby/blob/trunk/sample/openssl/echo_cli.rb</a></li>
<li><a href="http://andyjeffries.co.uk/articles/x509-encrypted-authenticated-socket-ruby-client">http://andyjeffries.co.uk/articles/x509-encrypted-authenticated-socket-ruby-client</a></li>
</ul>
<p>As anyone who has done any SSL work in Ruby knows,
<a href="http://ruby-doc.org/stdlib/libdoc/openssl/rdoc/index.html">Ruby’s OpenSSL Docs</a>
suck, but as I am not doing anything to directly improve them, I suppose bitching is rather pointless.</p>
Helpful Steps to Eat an Evening2011-01-15T00:00:00-05:00http://mathish.com/2011/01/15/helpful-steps-to-eat-an-evening<p>Why spend our free time doing things we enjoy when it could be better spent
fighting with calculating machines?</p>
<ol>
<li>Update rubygems via <code>gem update --system</code>, only to discover after the fact
that RubyGems 1.4 is not compatible with Ruby 1.9.x.</li>
<li>Uninstall all gems, all Ubuntu ruby packages and then
<code>rm -rf /usr/lib/ruby/</code> just for good measure.</li>
<li>Black out again, wake up in a car that is
<a href="http://www.youtube.com/watch?v=AUhE5KsJ5hk">Tokyo drifting everywhere</a></li>
<li>Install <a href="http://rvm.beginrescueend.com/">RVM</a></li>
<li>Re-configure <a href="https://github.com/capistrano/capistrano">capistrano</a> to use
the RVM plugin</li>
<li>Get the first successful <code>cap deploy</code> of the evening, but we’re running
the production site on a sqlite database.</li>
<li>Add <code>gem 'pg'</code> to the <code>Gemfile</code></li>
<li>Fight with getting bundle to update <code>Gemfile.lock</code> until I remember to add
the MacPorts’ postgres install directory to <code>PATH</code> and export
<code>ARCHFLAGS="-arch x86_64"</code></li>
<li>Re-<code>cap deploy</code>, Re-<code>cap deploy:migrate</code>.</li>
<li>Here we are.</li>
</ol>
<p>On the plus side, the newest version of
<a href="http://www.modrails.com/">Phusion Passenger</a> runs with RVM like a god
damned champ.</p>
OpenSSL in Ruby2011-01-14T00:00:00-05:00http://mathish.com/2011/01/14/openssl-in-ruby<p>The following code assumes that there is a subdirectory named <code>certs</code>
containing known certificates in PEM format, and a subdir <code>keys</code>
containing the client’s private RSA key. Further, there are lots of comments
specific to my actual needs, namely exporting keys generated in Java using
<code>keytool</code> for an <a href="http://activemq.apache.org/">Apache ActiveMQ</a> message
broker. Lastly, to use the <code>ca_path</code> method, the <code>certs</code> directory needs to
be properly indexed using <code>c_rehash</code> (make sure the underlying version of
<code>openssl</code> matches the version Ruby’s OpenSSL extension was built against,
otherwise the hash algorithm may not be the same.)</p>
<p>The code that follow was written for my own benefit in understanding the
mapping between the OpenSSL C API and API available in Ruby. The actual
connection established is specific to my needs, but the OpenSSL setup should
be pretty common. The type of the private key will differ depending upon the
algorithm used during the generation of the certificate.</p>
<div class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="c1">#!/usr/bin/env ruby</span>
<span class="nb">require</span> <span class="s1">'socket'</span>
<span class="nb">require</span> <span class="s1">'openssl'</span>
<span class="no">SSL_HOST</span> <span class="o">=</span> <span class="s1">'localhost'</span>
<span class="no">SSL_PORT</span> <span class="o">=</span> <span class="mi">61612</span>
<span class="no">SSL_CERT_DIR</span> <span class="o">=</span> <span class="no">File</span><span class="o">.</span><span class="n">expand_path</span><span class="p">(</span><span class="s1">'certs'</span><span class="p">,</span> <span class="no">File</span><span class="o">.</span><span class="n">dirname</span><span class="p">(</span><span class="bp">__FILE__</span><span class="p">))</span>
<span class="no">SSL_BROKER_CERT</span> <span class="o">=</span> <span class="no">File</span><span class="o">.</span><span class="n">expand_path</span><span class="p">(</span><span class="s1">'broker.pem'</span><span class="p">,</span> <span class="no">SSL_CERT_DIR</span><span class="p">)</span>
<span class="no">SSL_CLIENT_CERT</span> <span class="o">=</span> <span class="no">File</span><span class="o">.</span><span class="n">expand_path</span><span class="p">(</span><span class="s1">'client.pem'</span><span class="p">,</span> <span class="no">SSL_CERT_DIR</span><span class="p">)</span>
<span class="no">SSL_CLIENT_KEY</span> <span class="o">=</span> <span class="no">File</span><span class="o">.</span><span class="n">expand_path</span><span class="p">(</span><span class="s1">'keys/client.key'</span><span class="p">,</span> <span class="no">File</span><span class="o">.</span><span class="n">dirname</span><span class="p">(</span><span class="bp">__FILE__</span><span class="p">))</span>
<span class="no">USE_BROKER_CERT_FILE</span> <span class="o">=</span> <span class="kp">false</span>
<span class="no">USE_CLIENT_CERT</span> <span class="o">=</span> <span class="kp">false</span>
<span class="c1"># Things to note:</span>
<span class="c1">#</span>
<span class="c1"># I am using Apache ActiveMQ to test this SSL stuff. It's java based, so there's</span>
<span class="c1"># a lot of `keytool` to `openssl` conversion going on, including an external Java</span>
<span class="c1"># program to dump the private key from the keytool keystore, because apparently</span>
<span class="c1"># keytool doesn't provide a way to do that.</span>
<span class="c1">#</span>
<span class="c1"># The commands used:</span>
<span class="c1">#</span>
<span class="c1"># 1. Active MQ client and broker cert generation:</span>
<span class="c1"># Create the broker cert/keypair</span>
<span class="c1"># > keytool -genkey -alias broker -keyalg RSA -keystore broker.ks</span>
<span class="c1"># Export the broker certificate (DER format)</span>
<span class="c1"># > keytool -export -alias broker -keystore broker.ks -file broker.der</span>
<span class="c1"># Create the client cert/keypair</span>
<span class="c1"># > keytool -genkey -alias client -keyalg RSA -keystore client.ks</span>
<span class="c1"># Add the broker cert to the client's trust-store (just to generate the trust stores,</span>
<span class="c1"># we don't use the client store on the client side, because we're using Ruby + OpenSSL)</span>
<span class="c1"># > keytool -import -alias broker -keystore client.ts -file broker.der</span>
<span class="c1"># Export client certificate</span>
<span class="c1"># > keytool -export -alias client -keystore client.ks -file client.der</span>
<span class="c1"># Import client cert into broker trust store</span>
<span class="c1"># > keytool -import -alias client -keystore broker.ts -file client.der</span>
<span class="c1"># This gets all the keys and whatnot set up for ActiveMQ. Indeed, these steps</span>
<span class="c1"># can be found at: http://activemq.apache.org/how-do-i-use-ssl.html</span>
<span class="c1">#</span>
<span class="c1"># Next, we need to get these keys into OpenSSL acceptable forms</span>
<span class="c1"># (see: http://conshell.net/wiki/index.php/Keytool_to_OpenSSL_Conversion_tips)</span>
<span class="c1"># Convert the broker keytool DER cert into a PEM cert</span>
<span class="c1"># > openssl x509 -out broker.pem -outform pem -in broker.der -inform der</span>
<span class="c1"># Convert the client keytool DER cert into a PEM cert</span>
<span class="c1"># > openssl x509 -out client.pem -outform pem -in client.der -inform der</span>
<span class="c1"># As I am using ActiveMQ, there isn't a need to generate anything more on the</span>
<span class="c1"># broker side. The client just needs the PEM form for SSL trust.</span>
<span class="c1"># However, when the broker requires ssl authentication (needClientAuth=true on the</span>
<span class="c1"># transport URI), we will need the client's private key from the keystore as well.</span>
<span class="c1"># Unfortunately, there is no keytool command (as far as I've seen so far) that will</span>
<span class="c1"># export this from a java keystore. So, we make use of the DumpKey program copied</span>
<span class="c1"># from http://www.herongyang.com/crypto/Migrating_Keys_keytool_to_OpenSSL.html and</span>
<span class="c1"># found in examples/DumpKey.java to export the private key.</span>
<span class="c1"># Finally, we convert the private key output to a form usable by OpenSSL:</span>
<span class="c1"># > openssl enc -in client_bin.key -out client.key -a</span>
<span class="c1"># And wrap the output file with "-----BEGIN/END PRIVATE KEY-----" as outlined in</span>
<span class="c1"># http://www.herongyang.com/crypto/Migrating_Keys_keytool_to_OpenSSL_4.html</span>
<span class="c1">#</span>
<span class="c1"># Quite a bit of work... thanks Java! Hopefully tests will require less work</span>
<span class="c1"># by using only OpenSSH within a stub broker.</span>
<span class="n">tcp_sock</span> <span class="o">=</span> <span class="no">TCPSocket</span><span class="o">.</span><span class="n">new</span><span class="p">(</span><span class="no">SSL_HOST</span><span class="p">,</span> <span class="no">SSL_PORT</span><span class="p">)</span>
<span class="n">ctx</span> <span class="o">=</span> <span class="no">OpenSSL</span><span class="o">::</span><span class="no">SSL</span><span class="o">::</span><span class="no">SSLContext</span><span class="o">.</span><span class="n">new</span>
<span class="n">ctx</span><span class="o">.</span><span class="n">verify_mode</span> <span class="o">=</span> <span class="no">OpenSSL</span><span class="o">::</span><span class="no">SSL</span><span class="o">::</span><span class="no">VERIFY_PEER</span><span class="o">|</span><span class="no">OpenSSL</span><span class="o">::</span><span class="no">SSL</span><span class="o">::</span><span class="no">VERIFY_FAIL_IF_NO_PEER_CERT</span>
<span class="k">if</span> <span class="no">USE_BROKER_CERT_FILE</span>
<span class="c1"># Specify the cert file directly</span>
<span class="n">ctx</span><span class="o">.</span><span class="n">ca_file</span> <span class="o">=</span> <span class="no">SSL_BROKER_CERT</span>
<span class="n">ctx</span><span class="o">.</span><span class="n">ca_path</span> <span class="o">=</span> <span class="kp">nil</span>
<span class="k">else</span>
<span class="c1"># ... or the path to a series of cert files</span>
<span class="c1">#</span>
<span class="c1"># Theoretically, either method would work once c_rehash has been run on</span>
<span class="c1"># a directory containing the certs. However, with OpenSSL 1.0.0a, the</span>
<span class="c1"># ca_path setting appears to not work at all, so I cannot test this</span>
<span class="c1"># at the moment.</span>
<span class="c1">#</span>
<span class="c1"># Scratch that, OpenSSL didn't break anything, I did.</span>
<span class="c1"># Ruby's OpenSSL extension was built against the OS X OpenSSL version (0.9.8l)</span>
<span class="c1"># I was using c_rehash in such a way that it was calling on the 1.0.0a version</span>
<span class="c1"># installed via port. ca_path isn't broken, but the hashing mechanism between</span>
<span class="c1"># 0.9.x and 1.0.x is different.</span>
<span class="n">ctx</span><span class="o">.</span><span class="n">ca_file</span> <span class="o">=</span> <span class="kp">nil</span>
<span class="n">ctx</span><span class="o">.</span><span class="n">ca_path</span> <span class="o">=</span> <span class="no">SSL_CERT_DIR</span>
<span class="k">end</span>
<span class="k">if</span> <span class="no">USE_CLIENT_CERT</span>
<span class="n">ctx</span><span class="o">.</span><span class="n">cert</span> <span class="o">=</span> <span class="no">OpenSSL</span><span class="o">::</span><span class="no">X509</span><span class="o">::</span><span class="no">Certificate</span><span class="o">.</span><span class="n">new</span><span class="p">(</span><span class="no">File</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="no">SSL_CLIENT_CERT</span><span class="p">))</span>
<span class="n">ctx</span><span class="o">.</span><span class="n">key</span> <span class="o">=</span> <span class="no">OpenSSL</span><span class="o">::</span><span class="no">PKey</span><span class="o">::</span><span class="no">RSA</span><span class="o">.</span><span class="n">new</span><span class="p">(</span><span class="no">File</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="no">SSL_CLIENT_KEY</span><span class="p">))</span>
<span class="k">end</span>
<span class="nb">puts</span> <span class="s2">"CA File: </span><span class="si">#{</span><span class="n">ctx</span><span class="o">.</span><span class="n">ca_file</span><span class="si">}</span><span class="s2">"</span>
<span class="nb">puts</span> <span class="s2">"CA Path: </span><span class="si">#{</span><span class="n">ctx</span><span class="o">.</span><span class="n">ca_path</span><span class="si">}</span><span class="s2">"</span>
<span class="n">ssl_sock</span> <span class="o">=</span> <span class="no">OpenSSL</span><span class="o">::</span><span class="no">SSL</span><span class="o">::</span><span class="no">SSLSocket</span><span class="o">.</span><span class="n">new</span><span class="p">(</span><span class="n">tcp_sock</span><span class="p">,</span> <span class="n">ctx</span><span class="p">)</span>
<span class="n">ssl_sock</span><span class="o">.</span><span class="n">sync_close</span> <span class="o">=</span> <span class="kp">true</span>
<span class="n">ssl_sock</span><span class="o">.</span><span class="n">connect</span>
<span class="k">begin</span>
<span class="n">ssl_sock</span><span class="o">.</span><span class="n">post_connection_check</span><span class="p">(</span><span class="s1">'Apache ActiveMQ'</span><span class="p">)</span>
<span class="k">rescue</span> <span class="o">=></span> <span class="n">ex</span>
<span class="nb">puts</span> <span class="s2">"!!!!!! WARNING !!!!!!!"</span>
<span class="nb">puts</span>
<span class="nb">puts</span> <span class="s2">"Post check failed"</span>
<span class="nb">puts</span> <span class="s2">"</span><span class="si">#{</span><span class="n">ex</span><span class="o">.</span><span class="n">inspect</span><span class="si">}</span><span class="s2">"</span>
<span class="nb">puts</span>
<span class="nb">puts</span> <span class="s2">"!!!!!! WARNING !!!!!!!"</span>
<span class="k">end</span>
<span class="nb">puts</span>
<span class="nb">puts</span> <span class="n">ssl_sock</span><span class="o">.</span><span class="n">peer_cert_chain</span><span class="o">.</span><span class="n">inspect</span>
<span class="nb">puts</span>
<span class="c1"># ssl_sock.puts 'GET / HTTP/1.1'</span>
<span class="c1"># ssl_sock.puts "Host: #{SSL_HOST}"</span>
<span class="c1"># ssl_sock.puts 'Connection: close'</span>
<span class="c1"># ssl_sock.puts</span>
<span class="c1"># </span>
<span class="c1"># in_body = false</span>
<span class="c1"># while line = ssl_sock.gets</span>
<span class="c1"># in_body ||= line.chomp.empty?</span>
<span class="c1"># puts line if !in_body</span>
<span class="c1"># end</span>
<span class="n">ssl_sock</span><span class="o">.</span><span class="n">puts</span> <span class="s2">"CONNECT</span><span class="se">\n\n\x00</span><span class="s2">"</span>
<span class="c1">#sleep(2)</span>
<span class="n">t</span> <span class="o">=</span> <span class="no">Thread</span><span class="o">.</span><span class="n">new</span><span class="p">(</span><span class="n">ssl_sock</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">s</span><span class="o">|</span>
<span class="n">read_it</span> <span class="o">=</span> <span class="s2">""</span>
<span class="k">while</span> <span class="p">(</span><span class="n">c</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">getc</span> <span class="k">rescue</span> <span class="kp">nil</span><span class="p">)</span>
<span class="k">if</span> <span class="n">c</span><span class="o">.</span><span class="n">ord</span> <span class="o">==</span> <span class="mi">0</span>
<span class="nb">puts</span> <span class="s2">"Read Frame: </span><span class="si">#{</span><span class="n">read_it</span><span class="si">}</span><span class="s2">"</span>
<span class="n">read_it</span> <span class="o">=</span> <span class="s2">""</span>
<span class="k">else</span>
<span class="n">read_it</span> <span class="o"><<</span> <span class="n">c</span>
<span class="k">end</span>
<span class="k">end</span>
<span class="c1">#s.puts "DISCONNECT\n\n\x00"</span>
<span class="c1">#puts "Finally: #{s.gets}"</span>
<span class="k">end</span>
<span class="n">ssl_sock</span><span class="o">.</span><span class="n">puts</span> <span class="s2">"SUBSCRIBE</span><span class="se">\n</span><span class="s2">id:sub-001</span><span class="se">\n</span><span class="s2">destination:/queue/testing/ssl</span><span class="se">\n</span><span class="s2">ack:auto</span><span class="se">\n\n\000</span><span class="s2">"</span>
<span class="n">ssl_sock</span><span class="o">.</span><span class="n">puts</span> <span class="s2">"SEND</span><span class="se">\n</span><span class="s2">destination:/queue/testing/ssl</span><span class="se">\n</span><span class="s2">receipt:rcpt-001</span><span class="se">\n</span><span class="s2">content-length:5</span><span class="se">\n\n</span><span class="s2">hello</span><span class="se">\000</span><span class="s2">"</span>
<span class="nb">sleep</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="nb">puts</span>
<span class="nb">puts</span> <span class="s2">"Disconnecting"</span>
<span class="n">ssl_sock</span><span class="o">.</span><span class="n">puts</span> <span class="s2">"DISCONNECT</span><span class="se">\n\n\x00</span><span class="s2">"</span>
<span class="n">t</span><span class="o">.</span><span class="n">join</span>
<span class="n">ssl_sock</span><span class="o">.</span><span class="n">close</span></code></pre></div>