Since starting at Fog Creek, I’ve been learning about Mercurial from day one, since I’m working on Kiln. It was a big change from my work at Microsoft, where we used a VCS that was much closer to the Subversion model than the Mercurial model. One of my areas of focus in Kiln has been the import tool for teams migrating from Subversion. As I’ve tried to wrap my head around Subversion, Mercurial, and converting between the two, I’ve started to realize that many of the cultural differences between the two communities stem from basic technical strengths and weaknesses between the two products. Feel free to substitute Git for Mercurial, if that’s your cup of DVCS tea.
You could argue that the cultural differences led to the technical differences between the two camps. I suppose that’s probably true for the earliest contributors to the products, but it’s more likely that the technical strengths and weaknesses of each product appealed to those who naturally thought in certain ways, thus leading to the natural congregation of people with similar outlooks on creating software.
But enough of that. On to the differences.
Single project repositories vs. multiple project repositories
One change you’ll run up against, which was initially quite disconcerting for me, is that each project in Mercurial was contained in it’s own repository. I was used to one huge repository with different subdirectories for different projects. Consequently, because the code for Kiln is broken up into 5-10 separate repositories, I’ve spent the last few months asking others on my team if it wouldn’t be better to just combine some or all of our repositories. About once a month. I admit that I still think some combining would be good, but I’m beginning to understand more fully the mindset that leads to lots of small repositories.
This difference is one of the easiest to trace to basic differences in how the products work. Mercurial is much more narrowly focused, as a product, than Subversion is. Mercurial is all about tracking changes to a set of files. Subversion is all about tracking changes to each file separately. Mercurial tracks some repository wide information, such as branches, tags, and repository settings. Subversion allows you to branch, tag, and set properties on the whole repository or any subdirectory, or any random unrelated set of files, if you so desire.
Of the two, Subversion is far more general purpose in nature. It tracks changes to each file and directory separately, only keeping an overall revision number that tracks the chronological order of changes. Because of that, there are many features in Subversion that allow you to operate on a portion of the repository. You can check out a specific subdirectory, map files from another subdirectory, keep your working directory files at different revisions per file (called mixed revisions), and set properties on directories that apply to a directory and all of it’s children, such as which files to ignore when doing an svn status. Its even possible to set different permissions for different parts of the repository.
In contrast, Mercurial manages a single set of files in a repository. Directories are not first class objects in Mercurial, as they are in Subversion, they’re just artifacts of file names. Although internally, Mercurial tracks changes to each file separately there is no way to put the working directory into a mixed revisions state. The DAG cannot handle that type of freedom. Of course, the fact that Mercurial requires you to download the entire repository history to create your own working directory also puts downward pressure on the size of a repository. And because permissions are the same throughout a single repository, if different code needs different permissions, it also needs to be in a different Mercurial repository. The same is true for other settings, such as which files to ignore when doing hg status.
All of these differences naturally lead to smaller repositories that typically contain one project in Mercurial, and larger repositories that typically contain many, if not all projects, in Subversion. If you’re coming from Subversion, you’re going to want to get used to it. Fortunately it appeals to your innate desire to componentize — that is an innate desire, right? For me it is, and Mercurial makes it easier to do it at the project level. Of course, because Mercurial does less, it leaves to other systems the management of multiple repositories (see bitbucket and Kiln).
Sam Hart, when he decided to switch from Subversion to Mercurial, discussed this exact phenomenon:
“If you’re like me, when you originally set up SVN you did so in the laziest way possible.
“Setting up SVN repos is more work than it should be. It involves using commands that you normally never have to touch (svnadmin), setting up new entries for those repos in your http server’s configuration files (if you’re using Apache and WebDAV), and setting up user permissions to those repos. Thus, the lazy way to set them up is to make one central SVN repo under which you have multiple sub-repos. This has the advantage of making your repository very easy to maintain. However [it] has a big disadvantage in that a user with write access to any sub-repo will have write access to the entire repo.
“In Hg, on the other hand, setting up a new repository is much easier, and maintaining multiple repositories more manageable. So, if you’re like me, you may be tempted to remedy past sins by splitting your single gargantuan SVN repo into smaller Hg repos.”
Commit often vs commit when “ready”
Another change you’ll need to adjust to is to commit often. You’re probably used to making a bunch of related (or unrelated) changes, then doing some testing. You may build a version of your product and have others do testing. You’ll probably run automated tests, possibly multiple sets of automated tests. And finally, you’ll check in.
If you do this in a team using Mercurial they’ll wonder where you disappeared to while your code was being written, complain about how large the code reviews are, and be frustrated at how slowly you iterate on your code towards a good solution.
On my team at Microsoft, we had a concept of a shippable chunk of software. This helped guide the creation of branches in our centralized VCS. We could work in the branch, possibly with one or two other developers, until we had something we could reasonably ship, then merge the branch back into the main development repository. Depending on the rules for checking in to a VCS, whether centralized or distributed, software teams develop an understanding, either explicit or implicit of what a “committable chunk” is. What amount of code is worth committing, either for review or sharing with others.
The key change in mindset for me has been to make my own “committable chunks” much smaller than they used to be. No longer do I make hundreds of changes in tens of files, tying up another developer for hours in code reviews. It’s easy to make frequent commits locally, and push those to a personal branch on Kiln regularly for review.
But DVCS’s don’t just make it easy to have smaller committable chunks. They make it easier to manage committable chunks of all sizes. Because I work against a personal repository and merging is so easy, I can commit almost minute by minute to my personal repository, push multiple times a day to the feature branch I’m working on, push occasionally from the feature branch to the main development branch, and handle multi-feature pushes from development to a stable release branch. Obviously, those are all possible with a CVCS, but they always took so much time and effort to manage the branches, do the merge, and verify that nothing broke. In practice, that meant that steps were left out, and things slipped through the cracks.
Now, my changes to code are clearer, my original intentions more obvious, and I feel far better with my code in source control. I can look at changes at a small granular level, or I can look at the big merges.
Branch always vs. branch rarely
Closely related to a cultural norm of small, frequent commits is a norm of branching. Every clone of a Mercurial repository introduces a new branch once a change has been made. It’s also easy to branch many times within that clone. When I first made the switch, I didn’t really understand this. I knew I could work separately from other developers in my own repository, but I didn’t think of it conceptually as branching. It was more like I had my own sandbox, which I could then merge with the main repository when I checked in. And the idea of easily branching within my own repository still seems new to me.
But I’m learning to embrace the value in branching. As with frequent commits, it’s the ease of merging that makes the benefits of branching so readily available. And I’m beginning to value the power of having branches within my local repository. I can work on bug fixes separate from a major refactoring work, and easily (and quickly) switch between the two using a simple “hg up” command – even when I’m offline. That’s great for the times when I’m deep into feature code and a sudden urgent bug pops up that needs to be fixed and released immediately. I can also switch back and forth between work on two different features , which is great when I get stuck and just need a mental break from one of them. Also, it makes it super easy to prototype out new ideas without messing with my regular development.
One counterpoint to the ease of branching is that it may isolate developers. John DeRosa registers his concern about this:
“Additionally, I think distributed SCMs like Mercurial have a not-yet-fully-appreciated problem in making it too easy to not [ever] check code back into the main pool. With a local repository, a developer can feel protected from accidents and continue working happily for quite a long time. And then, say a year down the road, he/she does a massive check-in and discovers an integration problem. Branches, or a local repository that is effectively a private branch, should be easy to make — but not too easy.”
Let me explain why I don’t buy it. First, “a year down the road”?! Seriously? It says something that you have to imagine a scenario so horrible and unlikely in order to envision easy branching as a bad thing. I think that the author likely didn’t realize how easy merging usually is with a DVCS like Mercurial. And he must have totally forgotten that this lone maverick developer could have been merging the main development line into her own repository every day or week. The right solution to this imagined (and barely imaginable) scenario is not to eliminate easy branching, since without it the lone developer will do the same thing, but be much more likely to lose her work because it won’t be stored in a repository. The right solution is to fix a broken culture that enables someone to go a full year with no accountability for their work.
Source code files vs. all code files
Another important difference relates to what files you put in the repository. Because Mercurial and other DVCS’s don’t handle versioning of large files well, it is much more tempting to store them in a different way. This most obviously manifests itself in the storage of built binaries. If they are largish, and you want to keep lots of copies of them (nightly builds backed up for QA purposes, or even just weekly or monthly builds) then your repository becomes quite large and unwieldy very quickly. These types of files typically don’t diff well, making diffs between versions very large, and because the files are very large themselves, it means that downloading the repository takes much longer.
In this area, Subversion currently has a clear advantage. Only the files in the working directory are downloaded to client computers, so storing the history of large binary files only requires storage scaling on the server. Bandwidth is significantly reduced. Because it’s fairly simple, many Subversion installations have used it to track changes to built binaries and other very large files. The challenges to scale and management are limited to one machine, the server.
Naturally, users of Mercurial push handling of these large files to other systems. Their VCS is the location for their source files, typically a bunch of text files, which external tools then build into large binaries which are almost never stored in the same or another Mercurial repository. It is true that efforts are underway to alleviate this weakness in Mercurial, though I’m sure some don’t see it as a problem at all. The bfiles extension is an attempt to limit provide a more centralized model for certain large files. Of course, it has tradeoffs, but the fact that it’s being actively developed indicates that, at least for many, the tradeoffs are worth it.
For now, I’m happy that this aspect of Mercurial motivates me to automate more, to maintain more of the components of my products as code (in some form) that is compiled (using some method) to create these human-unreadable products.
Conclusion
There are obviously different ways to look at these cultural and philosophical differences between Subversion users and Mercurial users. One might look over the differences and conclude that Subversion seems much more flexible than Mercurial. Therefore, it must be better. Another might see how much better Mercurial handles basic source control features, such as branching, merging, and tags, and conclude that it is therefore a better product. It’s pretty obvious to me that these two views are quite related.
Michael Haggerty makes this point quite well in his post Git, Mercurial, and Bazaar—simplicity through inflexibility. The discussion is about the merging differences between Git and Subversion, but the principles apply to Mercurial as well. He argues that the very flexibility of Subversion is what makes merging more burdensome:
“Starting with release 1.5, Subversion, ironically, supports a much more flexible model of merging than the DAG-based DVCSs. Changes from any commit can be merged to any branch at the single-file level of granularity, enabling all of the operations listed above and some even weirder things (for example, a change that was originally applied to one file can be “merged” onto a completely different file). If your workflow demands this sort of thing, Subversion might hold significant advantages for you.
“But there are also many disadvantage to Subversion’s flexibility:
- Subversion’s merging model is more complicated than that of DAG-based VCSs, and therefore more complicated to implement and less predictable.
- It is much harder to visualize the history of a Subversion project (contrast that to DVCSs, whose history can be displayed as a single DAG).
- Subversion merges are innately slow, because of the large quantities of metadata that have to be manipulated.
- The bookkeeping of SVN merge info requires more user conscientiousness, and mistakes are not as easy to spot and fix.”
While he doesn’t take a stand on which is better, a CVCS like Subversion, or a DVCS like Mercurial or Git, I will. Mercurial (and other DAG based DVCS’s) provides a level of intrinsic guidance to developers through the limitations it has. Like many other great products, it is defined in part by what is not included. One might easily say that it is defined in large part by that. Products like the original iPod and iPhone both have this same feature. By focusing on the most important features, and specifically limiting users choice in other areas (changing batteries, how to buy and download apps, etc.), Apple created products that are wildly successful. True, they may not be as flexible as an Android, Blackberry, or Windows Phone. But they got the right things right.
And I think Mercurial is a step in that direction. I don’t think it’s there yet, but I don’t think anything else is any closer. Some other DVCS’s (git, at least) are also heading in the right direction, though they may be coming from a different starting point. Mercurial creates a philosophical and cultural starting point because of the technical choices that define both its strengths as well as its weaknesses. That philosophical starting point is a fundamentally better starting point for software development. It leads to greater componentization, greater granularity of history, more productive use of development time, and more automation.