Monthly Archive for March, 2010

The Trunk – Subversion Conversion to Mercurial, Part 1

So you’ve got a bunch of code, the key to your companies future, and like a good developer you’re keeping it under source control using Subversion. But you’ve heard about these new distributed version control systems, like Mercurial, and after doing some research you’ve decided to take the plunge. But now you face a challenge: how to get all that juicy code into a Mercurial repository?

Never fear! The wonderful Mercurial developers have created an the excellent extension that can convert Subversion repositories to Mercurial. Because I work on the Kiln tool to import Mercurial repositories from existing code in a different source control system, I’ve been working to understand more about how the whole conversion process works. To get a good basis of understanding, let’s first look at how the extension will import a single line of development – the trunk. This is reasonable in cases where there are no branches or where all branches have already been merged into trunk, and you don’t care about which changes were made on which branches. I’ll explore the process of converting tags and branches in future posts.

A Generic Algorithm

self.ui.status(_("scanning source...\n"))
heads = self.source.getheads()
parents = self.walktree(heads)

The convert extension is designed to be a generic converter from many different repository types. The overall convert algorithm is handled at this generic level, while the details of retrieving specific revisions, files, and tags are handled by converter source objects that are specific to the source repository type. A destination converter object (in this case Mercurial-specific), does the work of writing the revisions, files, and tags to the new Mercurial repository.

So you fire up your trusty shell, and kick off:

hg convert svn://path/to/your/svn/repository  --datesort

After parsing the command line, the converter creates the source and destination objects. We didn’t specify a filemap, but if a filemap had been specified, the converter would then wrap the source object, which is the subversion converter object, with a filemap source object. This filemap object uses the filemap file to adjust file paths before they are passed to or returned from the source repository object.

When you execute a typical hg convert command, the first output line you’ll see  is this:

scanning source...

When this appears, the converter begins by asking the source object to get the heads, or the latest revisions, of that repository. Because we’re ignoring branches for now, the subversion converter object will just get the latest revision under the trunk. After getting this revision, it walks backwards through the revisions until it reaches the beginning. As the converter retrieves each revision from the source converter object, it caches it and creates a map from the revision to a list of it’s parents. In the case of a single Subversion trunk, each revision will only have one parent.

Sorting the revisions

self.ui.status(_("sorting...\n"))
t = self.toposort(parents, sortmode)
num = len(t)

The converter needs to process the revisions in the right order. The hg convert command gives three sorting options: datesort, branchsort, sourcesort. The sourcesort option is not available when converting from Subversion. To perform either of the other sorts, the converter first creates a children map from the parents map, as well as a list of the roots, or revisions without parents. For our trunk-only conversion there will only be one root revision. Starting at this root revision, the converter chooses the next revision based on the ordering type. Then it adds any revisions whose parents are all in the ordering to the list of possibly next revisions, from which the next revision is chosen. For a subversion trunk-only conversion, there will only ever be one revision to choose from, regardless of the sort order. Therefore, I’ll discuss the differences between datesort and branchsort in part 2, on converting branches.

Importing changes

def copy(self, rev):
	commit = self.commitcache[rev]
	files, copies = self.source.getchanges(rev)
	parents = [self.map[p] for p in commit.parents]
	newnode = self.dest.putcommit(files, copies, parents, commit,
	self.source, self.map)
	self.source.converted(rev, newnode)
	self.map[rev] = newnode

Now that we’ve got this sorted list of revisions, the converter can start the process of converting each one individually. It does this by retrieving the appropriate changes from subversion and copying them to Mercurial.

When it initially walked the tree of changes, the subversion converter object stored the paths of files in each revision as well as the parent revisions. Because we’re looking at a trunk-only conversion, each revision will only ever have 1 parent. As the conversion proceeds, each of these revisions has the paths expanded. Each path is checked to see if it is a file, a directory, or a deleted item. File paths are recoded appropriately. Paths representing directories are expanded to include all files in the directory at that revision, and records of copied files and directories are also stored.

The Mercurial converter object then goes through the files and copies and retrieves the contents of each file from the subversion converter object. It uses the file contents to create the revision to be committed to the destination repository. That’s it. Your conversion is all done! Or is it? What if someone makes more changes to the subversion repository after you already performed the conversion?

Multiple hg convert runs

# Record converted revisions persistently: maps source revision
# ID to target revision ID (both strings).  (This is how
# incremental conversions work.)
self.map = mapfile(ui, revmapfile)

The hg convert extension supports multiple executions against the same source and destination repositories. This can be useful if you did one run of hg convert, and then later wanted to pull in further development from your subversion repository. This feature is primarily made possible by the revmap, a file that hg convert saves in the destination’s .hg directory. The revmap is just a simple map from revision ids in the source repository to revision ids in the destination repository. The hg convert extension reads this revmap in (if it exists) before beginning conversion. It uses the revmap to determine which revisions have already been converted, and accordingly begins with revisions that come after those already converted. One option, when running hg convert, is to specify where the revmap is – or where to save it if this is the first run against a given repository.

Another trick to consecutive hg convert runs is the authormap. The authormap is a file that allows you to change author names when converting from Subversion to Mercurial, which can be quite useful if you want to add additional information to Mercurial users, such as email addresses. The authormap, like the revmap, is stored in the destination .hg directory. On subsequent hg convert runs, this file is read in and used if no authormap is specified. If there is both an authormap specified on the command line and one in the destination .hg directory, the two are merged, with the one on the command line winning whenever there is a discrepancy.

How the filemap works

fmap = opts.get('filemap')
if fmap:
	srcc = filemap.filemap_source(ui, srcc, fmap)
	destc.setfilemapmode(True)

One last aspect of conversion deserves consideration – the filemap. Implementation of the filemap uses an interesting design. The code for handling the filemap is in a filemap converter object, much like the subversion converter object. This filemap converter wraps the subversion converter and does the mapping in a way that both the subversion converter and the hg converter can be oblivious to its presence.

The filemap converter object handles two major pieces of functionality. First, it takes care of renaming files. The renaming of files is done by a filemapper, which keeps a map of from and to filenames. Whenever filenames are passed to or from the converter object, it does the mapping necessary.

The more interesting challenge is determining which files and revisions should actually be included in the conversion. First, the filemap converter checks to see if a given revision includes any files that are included in the filemap. If so, then the revision needs to be converted. But the revision also needs to have it’s parent updated to the correct revision. In subversion, the parent of a given revision is simply the previous revision. Of course, that revision might not include any files in the filemap, and so be discarded during conversion. So the filemap converter needs to reparent the new revision to the last included revision also.

Coming Soon …

This algorithm at its root is quite simple. But understanding what is going on in the simple case is essential to understanding what is happening when we make it more complicated with branches, tags, and the options associated with them.  Part two will be a detailed look at how branches are converted from Subversion to Mercurial.

Commitments

In my post on how to care, I committed to start blogging more regularly, rather than sporadically. I began last week by spending an hour on my commute each day doing writing and research for the post I published Saturday. On Sunday I took a break from the entire internet. I plan to continue with that schedule. This coming week I’ll be eliminating some time-wasting websites from my life and replacing them with reviewing and acting on my next actions lists, including one for improving my blog. However, because I don’t want my blog to be primarily about blogging, or about my own personal growth, I’m not going to be publicly committing and reporting on my commitments here in the future. For those who care, they can see my current efforts on my public Daytum page.

Mercurial will make you a better developer

Since starting at Fog Creek, I’ve been learning about Mercurial from day one, since I’m working on Kiln. It was a big change from my work at Microsoft, where we used a VCS that was much closer to the Subversion model than the Mercurial model. One of my areas of focus in Kiln has been the import tool for teams migrating from Subversion. As I’ve tried to wrap my head around Subversion, Mercurial, and converting between the two, I’ve started to realize that many of the cultural differences between the two communities stem from basic technical strengths and weaknesses between the two products. Feel free to substitute Git for Mercurial, if that’s your cup of DVCS tea.

You could argue that the cultural differences led to the technical differences between the two camps. I suppose that’s probably true for the earliest contributors to the products, but it’s more likely that the technical strengths and weaknesses of each product appealed to those who naturally thought in certain ways, thus leading to the natural congregation of people with similar outlooks on creating software.

But enough of that. On to the differences.

Single project repositories vs. multiple project repositories

One change you’ll run up against, which was initially quite disconcerting for me, is that each project in Mercurial was contained in it’s own repository. I was used to one huge repository with different subdirectories for different projects. Consequently, because the code for Kiln is broken up into 5-10 separate repositories, I’ve spent the last few months asking others on my team if it wouldn’t be better to just combine some or all of our repositories. About once a month. I admit that I still think some combining would be good, but I’m beginning to understand more fully the mindset that leads to lots of small repositories.

This difference is one of the easiest to trace to basic differences in how the products work. Mercurial is much more narrowly focused, as a product, than Subversion is. Mercurial is all about tracking changes to a set of files. Subversion is all about tracking changes to each file separately. Mercurial tracks some repository wide information, such as branches, tags, and repository settings. Subversion allows you to branch, tag, and set properties on the whole repository or any subdirectory, or any random unrelated set of files, if you so desire.

Of the two, Subversion is far more general purpose in nature. It tracks changes to each file and directory separately, only keeping an overall revision number that tracks the chronological order of changes. Because of that, there are many features in Subversion that allow you to operate on a portion of the repository. You can check out a specific subdirectory, map files from another subdirectory, keep your working directory files at different revisions per file (called mixed revisions), and set properties on directories that apply to a directory and all of it’s children, such as which files to ignore when doing an svn status. Its even possible to set different permissions for different parts of the repository.

In contrast, Mercurial manages a single set of files in a repository. Directories are not first class objects in Mercurial, as they are in Subversion, they’re just artifacts of file names. Although internally, Mercurial tracks changes to each file separately there is no way to put the working directory into a mixed revisions state. The DAG cannot handle that type of freedom. Of course, the fact that Mercurial requires you to download the entire repository history to create your own working directory also puts downward pressure on the size of a repository. And because permissions are the same throughout a single repository, if different code needs different permissions, it also needs to be in a different Mercurial repository. The same is true for other settings, such as which files to ignore when doing hg status.

All of these differences naturally lead to smaller repositories that typically contain one project in Mercurial, and larger repositories that typically contain many, if not all projects, in Subversion. If you’re coming from Subversion, you’re going to want to get used to it. Fortunately it appeals to your innate desire to componentize — that is an innate desire, right? For me it is, and Mercurial makes it easier to do it at the project level. Of course, because Mercurial does less, it leaves to other systems the management of multiple repositories (see bitbucket and Kiln).

Sam Hart, when he decided to switch from Subversion to Mercurial, discussed this exact phenomenon:

“If you’re like me, when you originally set up SVN you did so in the laziest way possible.

“Setting up SVN repos is more work than it should be. It involves using commands that you normally never have to touch (svnadmin), setting up new entries for those repos in your http server’s configuration files (if you’re using Apache and WebDAV), and setting up user permissions to those repos. Thus, the lazy way to set them up is to make one central SVN repo under which you have multiple sub-repos. This has the advantage of making your repository very easy to maintain. However [it] has a big disadvantage in that a user with write access to any sub-repo will have write access to the entire repo.

“In Hg, on the other hand, setting up a new repository is much easier, and maintaining multiple repositories more manageable. So, if you’re like me, you may be tempted to remedy past sins by splitting your single gargantuan SVN repo into smaller Hg repos.”

Commit often vs commit when “ready”

Another change you’ll need to adjust to is to commit often. You’re probably used to making a bunch of related (or unrelated) changes, then doing some testing. You may build a version of your product and have others do testing. You’ll probably run automated tests, possibly multiple sets of automated tests. And finally, you’ll check in.

If you do this in a team using Mercurial they’ll wonder where you disappeared to while your code was being written, complain about how large the code reviews are, and be frustrated at how slowly you iterate on your code towards a good solution.

On my team at Microsoft, we had a concept of a shippable chunk of software. This helped guide the creation of branches in our centralized VCS. We could work in the branch, possibly with one or two other developers, until we had something we could reasonably ship, then merge the branch back into the main development repository. Depending on the rules for checking in to a VCS, whether centralized or distributed, software teams develop an understanding, either explicit or implicit of what a “committable chunk” is. What amount of code is worth committing, either for review or sharing with others.

The key change in mindset for me has been to make my own “committable chunks” much smaller than they used to be. No longer do I make hundreds of changes in tens of files, tying up another developer for hours in code reviews. It’s easy to make frequent commits locally, and push those to a personal branch on Kiln regularly for review.

But DVCS’s don’t just make it easy to have smaller committable chunks. They make it easier to manage committable chunks of all sizes. Because I work against a personal repository and merging is so easy, I can commit almost minute by minute to my personal repository, push multiple times a day to the feature branch I’m working on, push occasionally from the feature branch to the main development branch, and handle multi-feature pushes from development to a stable release branch. Obviously, those are all possible with a CVCS, but they always took so much time and effort to manage the branches, do the merge, and verify that nothing broke. In practice, that meant that steps were left out, and things slipped through the cracks.

Now, my changes to code are clearer, my original intentions more obvious, and I feel far better with my code in source control. I can look at changes at a small granular level, or I can look at the big merges.

Branch always vs. branch rarely

Closely related to a cultural norm of small, frequent commits is a norm of branching. Every clone of a Mercurial repository introduces a new branch once a change has been made. It’s also easy to branch many times within that clone. When I first made the switch, I didn’t really understand this. I knew I could work separately from other developers in my own repository, but I didn’t think of it conceptually as branching. It was more like I had my own sandbox, which I could then merge with the main repository when I checked in. And the idea of easily branching within my own repository still seems new to me.

But I’m learning to embrace the value in branching. As with frequent commits, it’s the ease of merging that makes the benefits of branching so readily available. And I’m beginning to value the power of having branches within my local repository. I can work on bug fixes separate from a major refactoring work, and easily (and quickly) switch between the two using a simple “hg up” command – even when I’m offline. That’s great for the times when I’m deep into feature code and a sudden urgent bug pops up that needs to be fixed and released immediately. I can also switch back and forth between work on two different features , which is great when I get stuck and just need a mental break from one of them. Also, it makes it super easy to prototype out new ideas without messing with my regular development.

One counterpoint to the ease of branching is that it may isolate developers. John DeRosa registers his concern about this:

“Additionally, I think distributed SCMs like Mercurial have a not-yet-fully-appreciated problem in making it too easy to not [ever] check code back into the main pool.  With a local repository, a developer can feel protected from accidents and continue working happily for quite a long time.  And then, say a year down the road, he/she does a massive check-in and discovers an integration problem.  Branches, or a local repository that is effectively a private branch,  should be easy to make — but not too easy.”

Let me explain why I don’t buy it. First, “a year down the road”?! Seriously? It says something that you have to imagine a scenario so horrible and unlikely in order to envision easy branching as a bad thing. I think that the author likely didn’t realize how easy merging usually is with a DVCS like Mercurial. And he must have totally forgotten that this lone maverick developer could have been merging the main development line into her own repository every day or week. The right solution to this imagined (and barely imaginable) scenario is not to eliminate easy branching, since without it the lone developer will do the same thing, but be much more likely to lose her work because it won’t be stored in a repository. The right solution is to fix a broken culture that enables someone to go a full year with no accountability for their work.

Source code files vs. all code files

Another important difference relates to what files you put in the repository. Because Mercurial and other DVCS’s don’t handle versioning of large files well, it is much more tempting to store them in a different way. This most obviously manifests itself in the storage of built binaries. If they are largish, and you want to keep lots of copies of them (nightly builds backed up for QA purposes, or even just weekly or monthly builds) then your repository becomes quite large and unwieldy very quickly. These types of files typically don’t diff well, making diffs between versions very large, and because the files are very large themselves, it means that downloading the repository takes much longer.

In this area, Subversion currently has a clear advantage. Only the files in the working directory are downloaded to client computers, so storing the history of large binary files only requires storage scaling on the server. Bandwidth is significantly reduced. Because it’s fairly simple, many Subversion installations have used it to track changes to built binaries and other very large files. The challenges to scale and management are limited to one machine, the server.

Naturally, users of Mercurial push handling of these large files to other systems. Their VCS is the location for their source files, typically a bunch of text files, which external tools then build into large binaries which are almost never stored in the same or another Mercurial repository. It is true that efforts are underway to alleviate this weakness in Mercurial, though I’m sure some don’t see it as a problem at all. The bfiles extension is an attempt to limit provide a more centralized model for certain large files. Of course, it has tradeoffs, but the fact that it’s being actively developed indicates that, at least for many, the tradeoffs are worth it.

For now, I’m happy that this aspect of Mercurial motivates me to automate more, to maintain more of the components of my products as code (in some form) that is compiled (using some method) to create these human-unreadable products.

Conclusion

There are obviously different ways to look at these cultural and philosophical differences between Subversion users and Mercurial users. One might look over the differences and conclude that Subversion seems much more flexible than Mercurial. Therefore, it must be better. Another might see how much better Mercurial handles basic source control features, such as branching, merging, and tags, and conclude that it is therefore a better product. It’s pretty obvious to me that these two views are quite related.

Michael Haggerty makes this point quite well in his post Git, Mercurial, and Bazaar—simplicity through inflexibility. The discussion is about the merging differences between Git and Subversion, but the principles apply to Mercurial as well. He argues that the very flexibility of Subversion is what makes merging more burdensome:

“Starting with release 1.5, Subversion, ironically, supports a much more flexible model of merging than the DAG-based DVCSs. Changes from any commit can be merged to any branch at the single-file level of granularity, enabling all of the operations listed above and some even weirder things (for example, a change that was originally applied to one file can be “merged” onto a completely different file). If your workflow demands this sort of thing, Subversion might hold significant advantages for you.

“But there are also many disadvantage to Subversion’s flexibility:

  • Subversion’s merging model is more complicated than that of DAG-based VCSs, and therefore more complicated to implement and less predictable.
  • It is much harder to visualize the history of a Subversion project (contrast that to DVCSs, whose history can be displayed as a single DAG).
  • Subversion merges are innately slow, because of the large quantities of metadata that have to be manipulated.
  • The bookkeeping of SVN merge info requires more user conscientiousness, and mistakes are not as easy to spot and fix.”

While he doesn’t take a stand on which is better, a CVCS like Subversion, or a DVCS like Mercurial or Git, I will. Mercurial  (and other DAG based DVCS’s) provides a level of intrinsic guidance to developers through the limitations it has. Like many other great products, it is defined in part by what is not included. One might easily say that it is defined in large part by that. Products like the original iPod and iPhone both have this same feature. By focusing on the most important features, and specifically limiting users choice in other areas (changing batteries, how to buy and download apps, etc.), Apple created products that are wildly successful. True, they may not be as flexible as an Android, Blackberry, or Windows Phone. But they got the right things right.

And I think Mercurial is a step in that direction. I don’t think it’s there yet, but I don’t think anything else is any closer. Some other DVCS’s (git, at least) are also heading in the right direction, though they may be coming from a different starting point. Mercurial creates a philosophical and cultural starting point because of the technical choices that define both its strengths as well as its weaknesses. That philosophical starting point is a fundamentally better starting point for software development. It leads to greater componentization, greater granularity of history, more productive use of development time, and more automation.