Sunday, September 18, 2005

The Doldrums of History

In a recent paper I wrote that a revision control system with lossy storage is a possible way to get around the problems that typically affect revision control. The reactions to this were not very surprising, typically reducing to something along the lines of 'Dude, storying all of the past storage is the whole point to revision control!'

Essentially, the RCS field has come to terms with the thought that storing the full history is a basic truth of revision control system. The reasoning that leads to this position is generally pretty pretty good. Several lines of reasoning support this assessment with three of the most important ones being: merging capabilities are affected by the amount of history present, the person responsible for the codebase may come under legal scrutiny current and users have the expectation that anything they store they will be able to retrieve.

This evidence, though it appears to be solid on the surface, is actually circumstantial. In this paper I'll examine the three reasons at length and illustrate that storing full history, while potentially useful, is not actually an inherent truth of revision control systems.

The first supporting reason for lossless revision storage is that merging is affected by the history present. This statement is certainly true with certain implementations; a user attempting to perform a merge with history related to the merge with the RCS system named "tla" is in for a hard row indeed. Other revision control systems, such as Bazaar, are able to partially sidestep the problem by searching for, and usually finding a suboptimal, but fully traceable, history. These two tools have this obligation because they rely upon something called three-way merging. The details behind three way merging are complex, though a oversimplified explanation will suffice for this conversation: A three way merge figures out which patches to apply by finding a previous point in time in which the two branches were identical. The changes between the common ancestsor and the banch to be merged are figured out and then these changes are applied to the other branch.

Other revision control systems, such as bazaar-ng, are seriously discussing avoiding three way merges entirely. The process is a bit different, but presumably also requires the knowledge of what modifications have already been applied. By applying just the patches that are missing, one can perform the equivilant of three way merge without actually performing a merge. This means that finding a common ancestor is no longer obligatory.

The full history is not required to perform these merges. The only fundamental component that is strictly required to perform a three way merge is a record of what has been previously applied so as to avoid applying it again. This does not mean that many revision control systems do not keep the full history (many of them, such as Bazaar, do so by default). This extra information isn't kept strictly to perform a three way merge; rather its kept in order to perform more advanced types of three way merges such as intentionally storing a false copy of accurate history (which I'll henceforth refer to as "a corrupted merge"). In simpler terms, *a record of what history has been seen is required, but not the history itself*

Prior to going into this argument I should clearly point out that I do not qualify as a lawyer, a paralegal or for that matter a filing clerk at the courthouse. The information contained herein is not intended to be legal advice. If you are seeking legal advice then I suggest you seek a lawyer for lawyerly things.

The second listed reason for indefinitely history is that the person maintaining a branch or archive may come under legal scrutiny. A compelling example of this reasoning is a court case (still going on at the time of this writing) named SCO Vs. IBM. Though SCO's arguments are rather unclear, the cental tenant of their argument seems to be that IBM misapropriated source code that belonged to SCO was unlawfully given to the Linux kernel. The Linux community reacted quickly to these charges by scouring the source code for submissions that were potentially infringing. The result of this search is generally considered negative; the public is not able to identify infringing code. This case is very well known, so for the sake of discussion lets assume that infringing code had been found. We'll also assume, again for the sake of argument, that IBM was the proper entity to litigate against.

The assumption that infinging code had beenperson that found makes no difference, as legal responsibility flows upwards. The person ultimately legally responsible for an infringing action is not the person that performed the infringement, but the highest ranking executive (or lead developer for those projects that do not have a formal organization) that had control over the organization's policies. The failure, though created by a typically low ranking individual, should have been caught misapropriating code by policy and workflow monitoring that ensured that such infringement could not happen without being caught eventually. This can be illustrated by examing the policies that Linus Torvalds put into place after the initial SCO allegations; he would be even more guilty than IBM for having first accepted (via delegation), then propogated the infringement. To avoid this sort of repeat Torvalds now requires that each person involved in the chain of merging of a submission sign off on code that it is not infringing.

This clearly illustrates that clearly identifying exactly who contributed what code at what time becomes less important as the longer the infringing code remains in the codebase, the more responsible the maintainer is and the less responsible the submitter. Regardless, its typically
nice to be able to point the finger at somebody else and blame someone else on a moral level, even if legal responsibility can not be evaded. This feel-good act is actually ultimately self-destructive.

The majority of programmers live under legal systems that are very closely related to the British style of judicial reasoning. In these legal systems plausible deniability is a an appropriate legal excuse. Any entity being sued another one looses the ability to destroy any of the entity's history or risk charges of destruction of evidence, tampering with evidence, interfering with an investigation and so on. Organizations will typically evade this sort of problem by intentionally intituting policies that ensure for the destruction of all possible company records after the elapsation of any legally mandated minimums.

These two facts put side by side show that from a legal position indefinitely storing history is not a net win; rather its a drawback. Storing as much history as possible will not provide a legal defense, but will hinder one.

The final listed reason for storing full history is that the user of a revision control system typically expects to be able to pull out by demand the representation of his software at any arbitrary point in the past. Supposedly any revision control system that is not able to perform this action is not a revision control system at all, which can pose interesting questions for whether systems like git is a revision control system at all.

This expectation does not meet with actual practice. The value of any given revision of software is inversely proportional to the distance from the current version. As a case in point, consider which version of his software a developer would rather loose: a version from two weeks ago or a version from two years ago. Additionally, the further back in time a developer looks, the less granularity that a user needs. A user may wish to distinguish the code from today from the code from last thursday, or the code from the beginning of this month from the beginning of two months ago, or today's code from two years ago. However, when users are comparing today's code against code from two years ago aren't generally doesn't care if the the version from two years ago happened on a Tuesday or a Wednesday.

Unfortunately most contemporary revision control systems treat all versions equally. The baggage that comes with storing the third revision alongside the six thousandth revision is for all practical purposes equivilant. Thus, a version of inherently less value costs just as much as current software.

The equal cost for revisions means that as the number of revisions increase, the cost for working with the newer revisions will increase. In many RCSs the definition of the current version is in part defined by the sum of its previous versions with a little bit more added. Different RCSs handle the past history differently; Arch and derivitaves such as Bazaar store this previous history as changesets. Others, such as Bazaar-NG, store the changes on a per-file basis. Regardless, typically all previous revisions are there in some form or another, standing by just in case they are needed by the user or the RCS.

Many revision control systems,including Bzr for the not-most-current case, use a recursive algorithm: To get a particular version of a file or changeset, one applies a set of changes to the previous file or revision, which in turn will apply a set of changes to the previous file or revision. This recursive algorithm is the achilles heel of revision control systems.. The achilles heel shows itself in two ways. The first problem is that in order to use the latest versions one must carry around the older revisions as well. The second problem is that in order to access the latest versions one must access the older versions.

The result of these two problems is that the cost of any current version is the sum of the cost of all previous versions plus a little bit extra. Thusly, as versions are saved, the cost goes up while the worth declines. Remember: any given version has less value as it ages. The resulting inflation, because of the recursive nature, affects the latest versions most strongly.

Many people have been seen stating that a good revision control system solution is similiar to a good backup solution. Surely this premise applies to a revision control system then it would apply to a backup system as well. If these two types of systems are similiar, then why doesn't one see this sort of lossy approach taken with backup systems? The simplest answer is that one does see *exactly this sort of behavior* in some backup systems. The traditional unix tape rotation system is explicitely designed to provide long term storage with a level of looser granularity as you look at older tapes. One will be able to find a tape for a specific day in the last two weeks, a tape for a specific previous week for the last month, a specific month in the last year and so on. Rarely does one find a backup system run by a highly qualified admin that stores a daily version going back to the project's inception (or even daily for the last years). Additionally, many moden revision control systems handle this rotation automatically by systematically pruning data that closely resides next to other older stored data.

Maintaining all previous history is not required. The common reasons for maintaining all history actually provides stronger arguments for tracking history and not maintaining it. This leaves RCS developers with a much more interesting question: When should things be tracked and when should things be maintained?

0 Comments:

Post a Comment

<< Home