20 June 2012

The ideal iteration length, part 2

In my previous post on the ideal iteration length, I looked at how iteration length affected our development of Confluence at Atlassian. I also gave my definition of an iteration:

An iteration is the amount of time required to implement some improvements to the product and make them ready for a customer to use.

When I started at Confluence in 2006, getting improvements ready for customers only happened irregularly, and we were unlikely to have anything release-worthy until close to the end of each multi-month release cycle. Through 2008—2010, we worked on a system of regular two-week iterations with deployments to our internal wiki, called Extranet. Selected builds were released externally for testing as well. This worked well, but we were still looking to improve the process.

Moving faster

In early 2011, we started looking at how we could get internal deployments available more quickly from our development code stream. There were two main sticking points:

  1. Upgrading the internal server meant taking it offline for up to 10 minutes while the upgrade was done. This was usually done during the day, so the dev team would be around to help out with any problems, but was a bit more inconvenient for everyone else.
  2. The release process still involved a bunch of manual steps which meant that building a release took one or two days of a developer’s time.

The first problem was solved with some ingenuity from our developers. We managed to find a hack where we could disable certain features of the application and take a short-term performance impact in order to do seamless deployments between near-identical versions of the software. We had to intentionally exclude upgrades which included any significant data or schema changes, but that still allowed the majority of our internal micro-upgrades to be done without any downtime.

The second problem was solved just with more hard work on the automation front. We hired a couple of awesome build engineers, and over the course of a few months, they’d taken most of the hard work out of the release process. In the end, we had a Bamboo build which built a release for us with a single click.

Once these problems were resolved, we moved our team’s Confluence space on to its own server with the seamless deployment infrastructure. We have now been deploying Confluence there with every commit to our main development branch for more than a year.

The ability to have our team’s wiki running on the latest software all the time is incredible. It enables everyone in our team to test out new functionality on the instance, confident that they’re playing around with the latest code. It allows someone to make a quick change and see it deployed immediately in the team’s working area and see what kind of improvement it might make.

Bug fixing is transformed by the ability to deploy fixes as quickly as they’re implemented. If a serious problem arises due a deployment that just went out, it is often simpler and faster to develop a fix and roll that change out to the server. That reduces unnecessary work around rolling back the instance to its previous version, and shortens the feedback loop between deployment of a feature and the team discovering any problems with it. In the long term, we’ve found that this improves the quality of the software and encourages the team to consider deployment issues during development.

Atlassian’s Extranet wiki, used by the entire organisation, has just moved on to our seamless deployment platform. I’ll have to report back later on how that pans out, but we’re optimistic about how it will help us deliver faster improvements to the organisation.

One-week iterations and continuous deployment

Late in 2011, Atlassian launched a new hosted platform called OnDemand. One of the most significant improvements for us internally with the new platform was a great new deployment tool called HAL. HAL supported deploying a new release on to a given instance via a simple web interface, and could roll out upgrades to thousands of customers at a time very easily in the same manner.

The OnDemand team at Atlassian now has a weekly release cycle, which is primarily limited by our customer’s ability to tolerate downtime, rather than any technical limitation.

In the Confluence team, we’re aiming to push out new parcels of functionality to these customers on that same timeframe, reducing our iteration length from two weeks to one, and reducing the time to ship new functionality to customers from a few months down to a week.

We have some problems with moving to this faster iteration model:

  • making sure all the builds are green with the latest code sometimes takes a couple of days, meaning the release process needs to wait until we confirm everything is working
  • our deployment artifact is a big bundle of all our products, so if a bug is identified late in any of the products, deployment of all of them might be delayed
  • we’ll be releasing any code changes we make to thousands of customers every week, rather than just internally.

Each problem requires a distinct solution that we’re working through at the moment.

For the first, we’ll be trying to streamline and simplify our build system. In particular, we want to make the builds required to ensure the functionality is working on OnDemand much more reliable and streamlined.

On the second problem, we’re looking to decouple our deployment artifacts so the products can be deployed independently. We would like to go even further than the product level, so we can update individual plugins in the product or enable specific features through configuration updates as frequently as we like.

The final problem requires us to ensure our automated tests are up to scratch and covering every important area of the application. It’s important that we also continue to extend the coverage as we add new functionality — often a challenge with cutting edge functionality. The platform provides an extremely good backup and restore system, so we also have a good safety net in place in case there are any problems.

What are the benefits of moving to a faster or continuous deployment model? They’re very similar to the benefits we first saw with the move to a two-week iteration cycle, just bringing them now to our customers:

  • customers will see small improvements to the product appear as soon as they are ready
  • bugs can be identified and fixed sooner, and those fixes made available to customers sooner
  • we can deploy specific chunks of functionality to a limited set of customers or beta-testers to see how it works out
  • releases for installed (also called “behind the firewall”) customers will contain mostly features that have already been deployed in small chunks to all the customers in OnDemand, reducing the risk associated with these big-bang releases.

That sums up the work the team is doing right now, to try to make this all possible.

What is the ideal iteration length?

Back to the original question then: what is the ideal iteration length? Let’s consider the various types of customers we have, and what they might say.

We certainly have some customers who will want to be on the bleeding edge and trying out the latest features even though it means some inconsistencies or minor bugs occasionally. We certainly prefer to run our internal wiki that way. These customers want to have the changes released as soon as they’re implemented — as short an iteration length as possible.

On the other hand, there are customers, particularly those running their own instance of Confluence, who prefer to upgrade on a schedule of months or years. These customers want stability and consistency and would prefer to have fewer features if it means more of the other. For these customers, an iteration length of several months might be too fast.

Most of our customers sit somewhere in the middle of these two extremes.

What we’ve concluded after all this work is that the decision on speed of delivery should be in your customers’ hands. Your job as an engineering team is to ensure there is no technical reason why you can’t deliver the software as often as they’d like, even if that is as fast as you can commit some changes to source control.

That way, when your customers change their minds and want to get that fix or feature right now, there’s no reason why you have tell them no.

Thanks for reading today’s article. If you’d know when I write something next, you can follow me (@mryall) on Twitter.