Friday, December 30, 2005

Moving my blog

I've moved my blog to

Sorry for the inconvienance.

Saturday, December 03, 2005

Ubuntu Below Zero Photos

I've put up my Ubuntu Below Zero pictures. These pictures manage to catch the last half of UBZ. Many of these photos were captured in darker areas; some with flash and some without.

I have also kept these photos in their original, untouched format. Please feel free to ask for help if you don't know how to remove red eye from photos.

Onto the photos! Ubuntu Below Zero photos

Wednesday, November 30, 2005

Test Driven Development

I've spent the last week or so pairing with Robert Collins (Bazaar-NG, Squid). His solitary goal was to force feed me TDD whether I liked it or not. He wasn't interested in my concerns about whether or not TDD caused problems with design. He was deaf to my worries that the code we worked on wouldn't result in solid testing. Nope. Not Rob. "I'm not going to make you do this for the rest of your life, but you're going to do it this week."

I jumped in with the promise that this particular tunnel really did have an end. The worst that could happen was that I would have to load up on tests and rewrite the code so that it would be sufficiently flexible to evolve over its lifetime. The best that could happen is that I would have to swallow my pride and say that he was right.

I didn't start writing test cases for my code until about a year ago. There wasn't much point; anything that I could write a test case for, I could do more flexibly by hand. I just didn't like the idea of putting aside real code for the purpose of fake code with the sole purpose of making the real stuff do some functional equivilant of jumping jacks.

Somebody once pushed me to try using them anyways. He suggested. He prodded. He reminded. He pushed. He said something very similiar to what Rob did: "Do it for a week and then decide whether or not there's use".

So I tried it. When I found a bug in a program, I wrote a test case. If I wrote code that was too complicated to test by hand, I wrote a test case. I even write test cases when submitting code to someone else.

Test cases are more like jumping jacks than I realized. Each test case is like hopping for a jumping jack - a little bit of effort and no real movement. But when does a lot of jumping jacks, something magical happens. Bodies strengthen. Health improves. Sickness is fought off more easily. Lots of good things come from jumping jacks.

The exact same thing happens with test cases. Your codebase, with a lot of good test cases, becomes strong. A small change over here won't cause subtle breakage somewhere else. If it does, your test cases will scream -- loudly and clearly -- that something has gone terribly, terribly wrong.

Test Driven development is more of the same thing with one notable exception: Write test cases first and code second. After all, isn't it better to do a few jumping jacks at a time when you're still young and healthy... or after you've already gotten fat and lazy?

Oh, I almost forgot to mention. The same person that turned me onto Test Driven Development is the same Rob Collins that got me into testing code in the first place.

Thanks Rob.

Sunday, October 02, 2005

Blog me Bentley

I just read Aaron Bentley's blog about Larry McVoy's latest infamous stunt.. McVoy seems to continue in the same nefarious activities as he has done before; namely, to threaten any company using bitkeeper that hires developers of free software revision control systems.

Whether or not McVoy can enforce these claims is only a question that the judicial system can answer. Regardless, Its. Just. Plain. Evil. Imagine if other companies behaved in this manner: McDonalds firing suppliers that ate at Taco Bell; American Airlines refusing to fly pilots of other airlines... the examples could go on.

Sunday, September 18, 2005

The Doldrums of History

In a recent paper I wrote that a revision control system with lossy storage is a possible way to get around the problems that typically affect revision control. The reactions to this were not very surprising, typically reducing to something along the lines of 'Dude, storying all of the past storage is the whole point to revision control!'

Essentially, the RCS field has come to terms with the thought that storing the full history is a basic truth of revision control system. The reasoning that leads to this position is generally pretty pretty good. Several lines of reasoning support this assessment with three of the most important ones being: merging capabilities are affected by the amount of history present, the person responsible for the codebase may come under legal scrutiny current and users have the expectation that anything they store they will be able to retrieve.

This evidence, though it appears to be solid on the surface, is actually circumstantial. In this paper I'll examine the three reasons at length and illustrate that storing full history, while potentially useful, is not actually an inherent truth of revision control systems.

The first supporting reason for lossless revision storage is that merging is affected by the history present. This statement is certainly true with certain implementations; a user attempting to perform a merge with history related to the merge with the RCS system named "tla" is in for a hard row indeed. Other revision control systems, such as Bazaar, are able to partially sidestep the problem by searching for, and usually finding a suboptimal, but fully traceable, history. These two tools have this obligation because they rely upon something called three-way merging. The details behind three way merging are complex, though a oversimplified explanation will suffice for this conversation: A three way merge figures out which patches to apply by finding a previous point in time in which the two branches were identical. The changes between the common ancestsor and the banch to be merged are figured out and then these changes are applied to the other branch.

Other revision control systems, such as bazaar-ng, are seriously discussing avoiding three way merges entirely. The process is a bit different, but presumably also requires the knowledge of what modifications have already been applied. By applying just the patches that are missing, one can perform the equivilant of three way merge without actually performing a merge. This means that finding a common ancestor is no longer obligatory.

The full history is not required to perform these merges. The only fundamental component that is strictly required to perform a three way merge is a record of what has been previously applied so as to avoid applying it again. This does not mean that many revision control systems do not keep the full history (many of them, such as Bazaar, do so by default). This extra information isn't kept strictly to perform a three way merge; rather its kept in order to perform more advanced types of three way merges such as intentionally storing a false copy of accurate history (which I'll henceforth refer to as "a corrupted merge"). In simpler terms, *a record of what history has been seen is required, but not the history itself*

Prior to going into this argument I should clearly point out that I do not qualify as a lawyer, a paralegal or for that matter a filing clerk at the courthouse. The information contained herein is not intended to be legal advice. If you are seeking legal advice then I suggest you seek a lawyer for lawyerly things.

The second listed reason for indefinitely history is that the person maintaining a branch or archive may come under legal scrutiny. A compelling example of this reasoning is a court case (still going on at the time of this writing) named SCO Vs. IBM. Though SCO's arguments are rather unclear, the cental tenant of their argument seems to be that IBM misapropriated source code that belonged to SCO was unlawfully given to the Linux kernel. The Linux community reacted quickly to these charges by scouring the source code for submissions that were potentially infringing. The result of this search is generally considered negative; the public is not able to identify infringing code. This case is very well known, so for the sake of discussion lets assume that infringing code had been found. We'll also assume, again for the sake of argument, that IBM was the proper entity to litigate against.

The assumption that infinging code had beenperson that found makes no difference, as legal responsibility flows upwards. The person ultimately legally responsible for an infringing action is not the person that performed the infringement, but the highest ranking executive (or lead developer for those projects that do not have a formal organization) that had control over the organization's policies. The failure, though created by a typically low ranking individual, should have been caught misapropriating code by policy and workflow monitoring that ensured that such infringement could not happen without being caught eventually. This can be illustrated by examing the policies that Linus Torvalds put into place after the initial SCO allegations; he would be even more guilty than IBM for having first accepted (via delegation), then propogated the infringement. To avoid this sort of repeat Torvalds now requires that each person involved in the chain of merging of a submission sign off on code that it is not infringing.

This clearly illustrates that clearly identifying exactly who contributed what code at what time becomes less important as the longer the infringing code remains in the codebase, the more responsible the maintainer is and the less responsible the submitter. Regardless, its typically
nice to be able to point the finger at somebody else and blame someone else on a moral level, even if legal responsibility can not be evaded. This feel-good act is actually ultimately self-destructive.

The majority of programmers live under legal systems that are very closely related to the British style of judicial reasoning. In these legal systems plausible deniability is a an appropriate legal excuse. Any entity being sued another one looses the ability to destroy any of the entity's history or risk charges of destruction of evidence, tampering with evidence, interfering with an investigation and so on. Organizations will typically evade this sort of problem by intentionally intituting policies that ensure for the destruction of all possible company records after the elapsation of any legally mandated minimums.

These two facts put side by side show that from a legal position indefinitely storing history is not a net win; rather its a drawback. Storing as much history as possible will not provide a legal defense, but will hinder one.

The final listed reason for storing full history is that the user of a revision control system typically expects to be able to pull out by demand the representation of his software at any arbitrary point in the past. Supposedly any revision control system that is not able to perform this action is not a revision control system at all, which can pose interesting questions for whether systems like git is a revision control system at all.

This expectation does not meet with actual practice. The value of any given revision of software is inversely proportional to the distance from the current version. As a case in point, consider which version of his software a developer would rather loose: a version from two weeks ago or a version from two years ago. Additionally, the further back in time a developer looks, the less granularity that a user needs. A user may wish to distinguish the code from today from the code from last thursday, or the code from the beginning of this month from the beginning of two months ago, or today's code from two years ago. However, when users are comparing today's code against code from two years ago aren't generally doesn't care if the the version from two years ago happened on a Tuesday or a Wednesday.

Unfortunately most contemporary revision control systems treat all versions equally. The baggage that comes with storing the third revision alongside the six thousandth revision is for all practical purposes equivilant. Thus, a version of inherently less value costs just as much as current software.

The equal cost for revisions means that as the number of revisions increase, the cost for working with the newer revisions will increase. In many RCSs the definition of the current version is in part defined by the sum of its previous versions with a little bit more added. Different RCSs handle the past history differently; Arch and derivitaves such as Bazaar store this previous history as changesets. Others, such as Bazaar-NG, store the changes on a per-file basis. Regardless, typically all previous revisions are there in some form or another, standing by just in case they are needed by the user or the RCS.

Many revision control systems,including Bzr for the not-most-current case, use a recursive algorithm: To get a particular version of a file or changeset, one applies a set of changes to the previous file or revision, which in turn will apply a set of changes to the previous file or revision. This recursive algorithm is the achilles heel of revision control systems.. The achilles heel shows itself in two ways. The first problem is that in order to use the latest versions one must carry around the older revisions as well. The second problem is that in order to access the latest versions one must access the older versions.

The result of these two problems is that the cost of any current version is the sum of the cost of all previous versions plus a little bit extra. Thusly, as versions are saved, the cost goes up while the worth declines. Remember: any given version has less value as it ages. The resulting inflation, because of the recursive nature, affects the latest versions most strongly.

Many people have been seen stating that a good revision control system solution is similiar to a good backup solution. Surely this premise applies to a revision control system then it would apply to a backup system as well. If these two types of systems are similiar, then why doesn't one see this sort of lossy approach taken with backup systems? The simplest answer is that one does see *exactly this sort of behavior* in some backup systems. The traditional unix tape rotation system is explicitely designed to provide long term storage with a level of looser granularity as you look at older tapes. One will be able to find a tape for a specific day in the last two weeks, a tape for a specific previous week for the last month, a specific month in the last year and so on. Rarely does one find a backup system run by a highly qualified admin that stores a daily version going back to the project's inception (or even daily for the last years). Additionally, many moden revision control systems handle this rotation automatically by systematically pruning data that closely resides next to other older stored data.

Maintaining all previous history is not required. The common reasons for maintaining all history actually provides stronger arguments for tracking history and not maintaining it. This leaves RCS developers with a much more interesting question: When should things be tracked and when should things be maintained?

Tuesday, September 13, 2005

The Achilles heel of DRCS

One of the things that most distributed revision control systems do is that they keep a permanant, indefinite record of everything that's ever happened to a branch. This can be useful if some company in Santa Cruz decides to defame you or your company. Most of the time this older information is unnecessary baggage that is essentially kept around just in case its ever needed.

This is the achilles heel of distributed revision control. As time passes the storage of past events grow indefinitely and become unwieldy to work with. Some tools, such as bazaar-NG, are working to address certain aspects of this.

Whats needed is a way to conflate history by merging together old revisions. By conflating revisions a bit of disk space is saved, network access is reduced and the amount of work involved in building by using ancestors is reduced.

Consider, for example, some sort of pool of patches. If the access to these patches is monitored then one can detect which patches are no longer very interesting. If two non-interesting patches are next to each other, then merge them into a single superpatch.The two patches in the pool can then be removed.

As time passes older, unused patches will continue to clump together. These multipatches can be used in the place of the individual patches in any place that all of them would have been applied anyways.

If the user wants a revision that has been merged into a multipatch, just go ahead and get it and add it to the pool. It can be munged with other patches just as with prior events.

Bazaar-NG Hackfest coming up

A bazaar-NG hackfest is coming up this November. The event is going to be in Toronto, Canada and will be work, work, work. If you like to work your ass off for free and make cool things happen, then I'll meet you there. :)

If you would like more information about the hackfest, then you can read about it at The Ubuntu Below Zero Page.

Sunday, September 11, 2005

New article

Tonight I wrote a new article for This article covers why distributed revision control is a good thing. You can find the article at My first LinuxGazette article