Thursday 13 November 2008

Hash Wrangling

Someone once asked me "just what is a Symbol in Ruby". I was stumbling around the web and found this: http://www.randomhacks.net/articles/2007/01/20/13-ways-of-looking-at-a-ruby-symbol

The way I personally think of it is as an immutable string that has a hashed internal representation, which probably encompasses definitions 4 from the web page above. Having coded to the Ruby C API, I also think of them in terms of them in terms of its hashed internal representation, and the C typedef ID (definitions 5 and 13).

But this also reminds of something I've been thinking about for a while with respect to hashes. Let's call it hash-wrangling:

wrangle |ˈraŋg(ə)l|
noun: a dispute or argument, typically one that is long and complicated.
verb:
  1. [ intrans. ] have such a dispute or argument : [as n. ] ( wrangling)weeks of political wrangling. See note at quarrel .
  2. [ trans. ] round up, herd, or take charge of (livestock) : the horses were wrangled early.
  3. another term for wangle .
ORIGIN late Middle English : compare with Low German wrangeln, frequentative of wrangen ‘to struggle’ ; related to wring .

Hashes are a great programming construct that I first encountered in awk's associative arrays. As a former C programmer, hashes were great because they provide tremendous power in freeing you from the mundane simplicity of C's data structures, and let you think about the relationships between different data. Gone are all thoughts of B-trees, search algorithms, etc and you're left with just the hash key and its related value.

The problem I see with hashes however is that it is far too easy to get the key wrong. It's quite easy to misspell a key, and have the resulting value stored in the wrong place, or fetched from the wrong place. The interpreter doesn't care what the hash keys are, and will just return a null when the value is fetched from the wrong place.

I've sometimes considered creating forms of enumerated types for the hash keys, as a convenient way to verify that only the correct keys are being used for a given hash. But this gets very long winded very quickly. And not only is it easy to get the key wrong, but with a hash using the "unknown" dataype (needed to allow nested hashes), it's also easy to get the type of the value wrong.

I think the only solution for this is to write test scripts that exercise every line of code. This sort of approach only goes so far however, and doesn't necessarily catch those cases where the hash keys is generated at runtime. You end up having to write lots of code that validates the inbound hash keys.

This is where an Object Oriented language like Ruby can help a lot because you can build the validation into the class definition itself. The class is basically a hash structure that you've defined in your code up front, and that is validated every time you access it. A hash is then relegated to only those values that might have been received from outside, but not yet passed to the class for validation.

This is not to say that the structure of the class and its data should become static. There is tremendous power in the way a dynamic language like Ruby has all classes open for the addition of new methods all the time. So, even if your class validates the data through its definition, thus staying DRYer in your approach to data structures, you can easily add more capabilities later.