CODING IN A COMPLEX DOMAIN

We learned how to rise to complex challenges the hard way

Author: Igor Tanacković, Chief System Architect

For a start, we have a question: what is more difficult –building a bridge over the Gulf of Corinth or performing heart surgery? Bridge construction requires a team of a hundred people, days, and months of work. Heart surgery requires a few experts and a couple of hours. Yet, bridge construction is easier. Why? Because the difficulty level depends on the domain where the challenges arise.

Bridge (or house, skyscraper) construction occurs in a complicated domain. This means that such a domain or system may be divided into smaller units where each of them may be solved individually. So, if the ground is weak – we’ll install piles. Then, we will build foundations and afterward floor by floor. And proceed in this manner to the finest details.

On the other hand, heart surgery occurs in a complex domain. A complex domain is a network of interactions that cannot be observed (or solved) individually – but collectively, as a system. So, all aspects of the surgery are required to function together – and at the same time.

We’re always in a complex domain

United Cloud is a development center. Developing entails innovations. This means that there is no established process or a ready-made solution to the problem we are dealing with. Even when you know what output you want (e.g., a feature that already exists on some other streaming platforms), the process which brings us to the desired output is always another kettle of fish.

When you’re working in a complex domain, you should know that there will be things that won’t run smoothly. EON launch is one of the stories we in the United Cloud retell each other during lunch. Maybe more often than we should.

Every beginning is complex

When we founded United Cloud, we were to develop EON with all functionalities very quickly. The first challenge was the streaming platform. Although we’d already had one, we had to re-edit it substantially so it would suit our product. The second challenge was Metadata servers, which provided the client application with data such as EPG, channel list, and event descriptions, as well as took care of client packages, policies, credentials, etc. (so, everything required for platform functionality). In addition, infrastructure design and set-up were required; however, the biggest challenge regarding EON TV was the fact that the platform should have been available to every single user 24/7 throughout the year!

We spent the entire 2017 developing the EON platform with a view of launching it on 5th September.

And that would’ve been great if that day had not been the day when the European Basketball Championship started in our country. And we launched EON not only in Serbia but also in Slovenia, Montenegro, and Bosnia. It seemed that the championship beginning was a great start – to attract viewers to EON straight away. Due to short deadlines, we were developing the platform until the very last day. We did the tests which we could do at that moment, everything seemed to be working, until…

Control room?!

It didn’t start! Complex domains are unpredictable. Something that runs perfectly in the test phase sometimes does not work in reality. So, we faced a few serious problems at the very start. The first problem was Distributed Denial of Service (DDoS). That was a distributed (hacker) attack with the idea of overloading the infrastructure so it would stop working. We slogged our guts out until we figured out who was doing that to us... In the end, it turned out it was our IOS application!

The next day, there was another thing. Serbian team was playing and the number of users increased. The game was scheduled for 8 p.m. and everything ran perfectly... Until 7:58 p.m. when all of a sudden everybody started to log in... At that moment we realized that our landing page had too much information. So, it was too “difficult”, because a lot of processing and memory should be used to ensure its functioning. Besides, it was not optimized which affected the performance. When a large crowd of users swarms in, servers go down – and we recover them. And all that happened in real-time, while the game was on! We didn’t get much sleep those days.

During games, the platform was more loaded, but when there was no game, the flow of users decreased and we had more space to maneuver. During the championship, we made jokes about Serbia and Slovenia playing in the final, like that was the last thing we needed. A couple of days later, when Serbia and Slovenia qualified for the final, we didn’t think it was that funny!

Now it’s a different story

It was this challenging solution that made us resolve some complex issues at the very start. From 10,000 devices at the very beginning, we’ve come to half a million.

We used to provide a couple of megabits per second – and now, 1.5 TERABITS. We don’t even need anyone to stay up during the night!

Now we do everything differently. We have established a process that is a result of work in a complex domain. We don’t allow ourselves so big deployments and we are particularly careful about the moment when we will turn to production, but most importantly, the entire process, from the start to the end of implementation is different, much different.

The main step for quality process implementation is codebase quality:

Codebase

We are trying to put as much quality as possible in every step. So, at the beginning, we take care of the quality of our codebase. This is the first thing to do. There are code standards, and code reviews where we actually discuss the implementation complexity. In this phase, we ask ourselves the following questions:

a. Can anything be coded more simply?

b. Is it simple enough for maintenance?

c. Is it a sufficiently robust solution?

d. Have we initially introduced any security issues?

e. Should we update our code standards?

Static analysis is an integral part of codebase quality

We actually use static analysis to check the quality of our code. These are relatively complex algorithms that check the status, whether we’ve added any “technical debt” which would have to be repaid later on, and whether the application itself is sufficiently robust (resistant to issues). We receive feedback immediately in an automated way. So, the algorithm itself alerts us. For example, we receive information that some class has to be changed because it is too long, it has 300 code lines, and that class is often handled by more than one developer. This means a potential problem. We even have some proposed solutions, e.g., to divide that particular class into several smaller ones or to refactor the code.

The next crucial step in process implementation is the deployment pipeline which we have successfully optimized during the previous 5 years. We’ve written a separate text about it because we are aware of the fact that nobody’s attention span is long enough to keep up with a text which is longer than three pages! Read it on this link.

Lastly, if you are coding in a complex domain too – contact us, and let’s talk (more) about good practices!