Build and Deploy

A big red deploy button.

The YPlan build and deploy process is without a doubt better than anywhere else I've ever worked and has a couple of interesting features I've never encountered before. Below I'm going to walk through an overview of that process.

Background

YPlan is an agile development shop in the fast and responsive sense and you can see this clearly in our tooling. To be properly agile you need to get your cycle times down and that's all about tools and processes. My favourite article on this is Doing continuous delivery? Focus first on reducing release cycle times.

To give you a sense of where we're at, here are the current vital stats for our monolithic Python/JS server build.

  • Minimum cycle time when following normal processes: <1 hour
  • Lines of code (including tests): Python ~200k, Javascript ~100k
  • Average pull requests per developer per work day: 2.4
  • Average deploys to production per work day: 4.1
  • Average database migrations per work day: 1.2
  • Site downtime for releases etc: 0

The build and deploy process itself runs as a series of 8 Jenkins jobs, I will cover each in a section below.

Test Diff (18 minutes)

Every version of every pull request is tested on Jenkins. This runs the extended version of our normal Python and Javascript test suites.

The developers can and do run these tests themselves or more commonly the parts of the tests relevant to what they are working on. However running them on Jenkins like this means the developers are not required to run the tests themselves and can get on with other tasks while that happens. Waiting for a Test Diff pass is also optional, if you have run the full test suite yourself and it passed you can skip this.

If something is going through the cycle at speed then code review will happen in parallel with Test Diff. Code review is where the main difference between normal and minimal cycle times comes up. For smaller things it's not uncommon for code review to be complete before Test Diff finishes, however in order to minimise interruptions the only requirement I place on the team is that they deal with any code reviews within one working day.

1 Tag Build (seconds)

After code review is complete the pull request is merged into master and this job tags it with a build number and kicks off the main build process.

2 Build Ami (17 minutes)

We deploy into AWS and all our machines are built as AMI's ready to be instantiated on AWS. The complete configuration of these machines is maintained in the codebase, mostly as Ansible scripts.

Nightly Jenkins jobs build base AMI's from scratch, these have our standard operating system setup and apt packages installed. They also have the Python and Javascript dependencies pre-cached but not installed.

After Tag Build runs, this job makes a new AMI by taking the base AMI and installing our code and dependencies (using the pre-cached versions where it can). It also sets up any additional config for stuff like Nginx. All of this is orchestrated by Ansible scripts and is largely the same when building or updating a developer VM instead.

3a Test New AMI (12 minutes)

This job runs our extended test suite again, it's similar to Test Diff but with a couple of notable differences.

First this is the actual final code and AMI whereas Test Diff runs on review code and so will miss any issues that arise when merging to master. In practice because of our rapid cycle time our pull requests are generally within hours of master so merge issues are rare.

Second Test Diff builds a fresh database to test on by running the full set of migrations. This job instead takes a snapshot of the current production schema and applies only the new migrations. Doing it this way accounts for any possible discrepancies between the migration history and the actual production schema.

3b Test Old AMI New Schema (10 minutes)

Running in parallel with Test New AMI this job tests the impact of any incoming migrations on the currently deployed production code.

It does this by taking a snapshot of the current production database schema and applying the new migrations like Test New AMI and then running the tests for the currently deployed version of the code instead of the to be deployed version.

If this passes it gives us a high level of confidence that those migrations can be run on the production databases immediately without breaking anything.

This approach does place some limitations on what you can do in a migration. Adding or modifying columns is fine but something like a column rename has to take place across several deployments. However those are rare and the cost is worth it to be able to migrate at will without needing downtime.

4 Migrate and Run Acceptance Tests (6 minutes)

First any new migrations are applied to the live production databases.

This is where the rubber really hits the road. Yes it's a fully automated path from merging to master to migrating the production databases and no I've never regretted it. As I noted above by this point we have a high level of confidence that this isn't going to break production.

Next we spin up an actual production instance, talking to the production database, memcache etc and do some basic sanity tests on it. The purpose of this step is to find problems in the environment config that wouldn't be picked up by the earlier tests. Broken Nginx config, unable to connect to the database etc.

5 Notify Deploy Ready (seconds)

This updates a big red deploy button on the front page of Jenkins with the newly tested build number and a link to the diff between that and currently deployed production code.

In the past we used to just automatically deploy, however as we've grown we've found that interferes with manual operations like online schema changes and data backfills. So instead we have opted for a low effort human triggered process that we run several times a day. This is the other main place where minimal and typical cycle times diverge.

6 Deploy (6 minutes)

The actual deployment uses a blue / green approach. A new cloud formation with a full spread of machines is spun up next to the existing deployment. As the health checks come good on the new machines they are connected to the load balancers. Once all the new machines are live the old formation is scaled down to zero machines and then an hour later garbage collected.

This is a zero downtime deployment. We pull the carpet out from under them roughly four times a day and the staff and customers never notice.

If somehow something does get through the above testing then immediate roll back consists of scaling up the old formation and then scaling down the new one, which is a few clicks on the AWS control panel.

Fin

And that in summary is it.

I wouldn't claim for a moment that this is the process to end all. It serves us very well at our current scale but it has grown with us and I expect it to continue to develop as we do.