How are things actually created in the Tech industry?
You receive a notification about a new app version. You update the app and all of a sudden, it runs better or has cool features. However, the developments preceding the notification are far more interesting. The development process of IT companies is not sufficiently discussed in our market. If you’re a programmer, you’ll hardly know what programming looks like in a company, even if you’ve passed the third-round interview and a couple of interviews with the HR team. That’s why we’ve decided to tell you what it looks like in the United Cloud. So, let’s start...
The specificity of our products lies in the fact that they should always be available. For example, the EON TV platform runs 24/7 throughout the year for a huge number of devices. This means that everything must run smoothly in production – because the platform never sleeps, just like people who binge-watch TV shows at 3 a.m.
The following pages illustrate what preproduction, production, and postproduction look like in our company. But let’s begin with a bit of ~ theory.
Trunk-based development
Trunk-based development (TBD) is a branching model where every change in the code is integrated into one, central branch called Trunk (main or master). So, whatever we’re developing, we create a branch from the trunk and integrate the changes back into the trunk. The build, i.e., the new app version, is always “extracted” from the trunk. Trunk-based development is the precondition for Continuous Integration. Continuous Integration Strategy involves as frequent integrations as possible to receive feedback as soon as possible, whereas Continuous Deployment involves constant production updates. So, the point is to integrate the code as often as possible and update the production in fewer iterations, and fewer iterations provide less space for potential code issues.
TBD complexity lies in the fact that the integration of an unfunctional code (bugs) into the trunk compromises the build integrity. Moreover, the larger the number of developers working on a project, the higher the risk of bug integration into the trunk. We handle this complexity by inspecting the integrated quality, i.e., by automated testing. And now we’re getting to the very production phases.
Preproduction
Preproduction includes several critical points. When you finish a code, and before you return the branch to the trunk, you’ll do the first test. This first test is the test of critical items. The first test should be as short as possible. If the coding took you 20 minutes, there’s no point in waiting three hours for the test to complete. In this preproduction phase, we are trying to decrease the waiting period as much as possible and test only essential items. For example, when an EON TV video cannot be played, this is a critical bug that must be fixed immediately. Such bugs are called “service disruption”. But, when an icon is displaced three pixels to the right, this is something you can live with and fix later on.
The following tests can be performed at night or during the day, whichever is most convenient. At this point, automated tests are launched, regardless of the developer and development. The most extensive testing is performed then because it is not important whether it will last 2 or 7 hours. When you come to work the next day, you’ll have comprehensive feedback and you’ll be able to start bug fixing. This test’s priority is quality, not time.
Only after the completion of these tests, may we be sure that our build is good enough to move to production and that it won’t cause any problem with the app update. The preproduction phase is hereby completed.
Production
When we say production, we mean the final environment where the build is publicly available. As mentioned earlier, we have a product that should be available all the time. So, we don’t have a moment when the EON TV platform is not useful and when we would be able to update it with new features. Finding the right balance is important here. On one hand, a new build brings additional value. It will fix some bugs, add some new features, and improve the user experience. On the other hand, every build poses a risk of something not working properly. Therefore, we use Iterative Deployment in production. This means that one part of the users obtains the new product version, and then we monitor the platform behavior. If everything runs as it should, then we proceed. So, we start off with 20% of users, then we make an increase to 50%, and proceed in this manner until we change the production by 100% of users.
Postproduction
However, the process following the production is more interesting to developers. It is then that we get feedback on what we have coded. Feedback shows us two important things:
1. Anticipated use-cases. Only when people start using your new feature, you’ll realize that we perhaps failed to cover all use-cases. Users use a product in very different ways. We can’t even imagine all that they will do. So, we have some new use cases too. When we experience such situations, we update our test cases.
2. Production performance degradation is more important information for us. It tells us whether the new build introduced some performance issue or some memory leak that we couldn’t detect on a small number of test cases.
That’s why various techniques are used in postproduction. For example, we use Monitoring, Alerting, and Test techniques, such as content duplication, traffic duplication, etc. Monitoring is important to us so that we can get feedback from the production, see what’s going on, and preventively respond if we see that something is not all right. E.g., memory charts often increase linearly. This wouldn’t be a problem if there were unlimited memory, but as it is limited, the service fails when it reaches a critical point. Therefore, we set the alert at 80%, and then it tells us that we have a problem somewhere. When this happens, we have to judge the situation. Sometimes this increase is ok, expected, and unlikely to continue increasing. But sometimes it is unexpected. In this case, it is a problem we have to resolve.
As an additional level of user protection, we’ve introduced the automated rollback. So, if something does not work, the application will automatically return back to the previous stable version.