This is one of the coolest and most important things we recently built at HackerEarth.
What’s so cool about it? Just have a little patience, you will soon find out. But make sure you read till the end 🙂
I hope to provide valuable insights into the implementation of a Continuous Deployment System(CDS).
At HackerEarth, we iterate over our product quickly and roll out new features as soon as they are production ready. In the last two weeks, we deployed 100+ commits in production, and a major release comprising over 150+ commits is scheduled for launch within a few days. Those commits consist of changes to backend app, website, static files, database, and so on. We have over a dozen different types of servers running, for example, webserver, code-checker server, log server, wiki server, realtime server, NoSQL server, etc. All of them are running on multiple EC2 instances at any point in time. Our codebase is still tightly integrated as one single project with many different components required for each server. When there are changes to the codebase, you need to update all the related dedicated servers and components when deploying in production. Doing that manually would have just driven us crazy and would have been a total waste of time!
Look at the table of commits deployed on a single day.
And with such speed, we needed an automated deployment system along with automated testing. Our implementation of CDS helped the team roll out features in production with just a single command: git push origin master. Also, another reason to use CDS is that we are trying to automate everything, and I see us going in right direction.
The process begins with the developer pushing a bunch of commits from his master branch to a remote repository, which in our case is set up on Bitbucket. We have set up a post hook on Bitbucket, so as soon as Bitbucket receives commits from the developer, it generates a payload(containing information about commits) and sends it to the toolchain server.
The toolchain server backend receives the payload and filters commits based on the branch and neglects any commit that is not from the master branch or of the type merge commit.
Filtered commits are then grouped intelligently using a file dependency algorithm.
The top commit of each group is sent for testing to the integration test server via rabbitmq. First, I wrote code which sent each commit for testing, but it was too slow. So Vivek suggested that I group commits from payload and run a test on the top commit of each group, which drastically reduced number of times tests are run.
Integration tests are run on the integration test server. There is a separate branch called test on which tests are run. Commits are cherry-picked from master onto test branch. Integration test server is a simulated setup to replicate production behavior. If tests are passed, then commits are put in release queue from where they are released in production. Otherwise, the test branch is rolled back to a previous stable commit and clean-up actions are performed, including notifying the developer whose commits failed the tests.
We have been using three branches — master, test, and release. In the Master, the developer pushes the code. This branch can be unstable. Test branch is for the integration test server and release branch is for the production server. Release and test branches move parallel, and they are always stable. As we write more tests, the uncertainty of a bad commit being deployed to production will reduce exponentially.
Each commit(or revision) is stored in the database. This data is helpful in many circumstances like finding previously failed commits, relating commits to each other using file dependency algorithm, monitoring deployment, etc.
Following are the Django models used:* Revision– commithash, commitauthor, etc. * Revision Status– revisionid, testpassed, deployedonproduction, etc. * Revision Files– revisionid, filepath * Revision Dependencies.
When the top commit of each group is passed to the integration test server, we first find its dependencies, that is, previously failed commits using the file dependency algorithm, and save it in the Revision Dependencies model so that we can directly query from the database the next time.
def get_dependencies(revision_obj):
dependencies = set()
visited = {}
queue = deque()
filter_id = revision_obj.id
queue.append(revision_obj)
while len(queue):
rev = queue.popleft()
visited[rev.id] = True
dependencies.add(rev)
dependent_revs = get_all_dependent_revs(rev, filter_id)
for rev in dependent_revs:
r_visited = visited.get(rev.id, None)
if not r_visited:
queue.append(rev)
#remove revision from it's own dependecies set.
#makes sense, right?
dependencies.remove(revision_obj)
dependencies = list(dependencies)
dependencies = sorted(dependencies, key=attrgetter('id'))
return dependencies
def get_all_dependent_revs(rev, filter_id):
deps = rev.health_dependency.all()
if len(deps)>0:
return deps
files_in_rev = rev.files.all()
files_in_rev = [f.filepath for f in files_in_rev]
reqd_revisions = Revision.objects.filter(files__filepath__in=files_in_rev, id__lt=filter_id, status__health_status=False)
return reqd_revisions
As we saw earlier in the Overview section, these commits are then cherry-picked onto the test branch from the master branch, and the process continues.
Commits that passed integration tests are now ready to be deployed. There are a few things to consider when deploying code to production, such as restarting webserver, deploying static files, running database migrations, etc. The toolchain code intelligently decides which servers to restart, whether to collect static files or run database migrations, and which servers to deploy on based on what changes were done in the commits. You might have noticed we do all this on the basis of types and categories of files changed/modified/deleted in the commits to be released.
You might also have noted that we control deployment to production and test servers from the toolchain server (that’s the one which receives payload from bitbucket). We use fabric to achieve this. A great tool indeed for executing remote administrative tasks!
from fabric.api import run, env, task, execute, parallel, sudo
@task
def deploy_prod(config, **kwargs):
"""
Deploy code on production servers.
"""
revision = kwargs['revision']
commits_to_release = kwargs['commits_to_release']
revisions = []
for commit in commits_to_release:
revisions.append(Revision.objects.get(raw_node=commit))
result = init_deploy_static(revision, revisions=revisions, config=config,
commits_to_release=commits_to_release)
is_restart_required = toolchain.deploy_utils.is_restart_required(revisions)
if result is True:
init_deploy_default(config=config, restart=is_restart_required)
All these processes take about 2 minutes for deployment on all machines for a group of commits or single push. Our life is a lot easier; we don’t worry anymore about pushing our code, and we can see our feature or bug fix or anything else live in production in just a few minutes. Undoubtedly, this will also help us release new features without wasting much time. Now deploying is as simple as writing code and testing on a local machine. We also deployed the hundredth commit to production a few days ago using automated deployment, which stands testimony to the robustness of this system.
P.S. I am an undergraduate student at IIT-Roorkee. You can find me @LalitKhattar.
This post was originally written for the HackerEarth Engineering blog by Lalit Khattar, Summer Intern 2013 @HackerEarth
Introduction In today's dynamic workplaces, a strong HR department is no longer a luxury –…
Job task analysis is a crucial process for understanding the specific duties and skills required…
In today's competitive talent landscape, attracting top candidates requires going beyond traditional job board postings.…
Finding the perfect fit for your team can feel like searching for a unicorn. But…
Recruitment forms a strong foundation to build an effective team. However, do you know if…
Introduction Performance appraisal has seen a tremendous change over the years. It is no longer…