Today, we’re sitting down with Daniil Koshelev, a Lead Software Engineer at T-Bank and a lecturer at ITMO University. With extensive experience in developing high-load systems, container technologies, and microservices architecture, Daniil has made significant contributions to both T-Bank and VK. Let’s dive into his journey and insights in the world of software engineering.
How did your journey in software engineering begin, and what led you to specialize in Golang and distributed systems?
Answer here:
My passion for computers, programming, and development began back in school, when I was learning to create my own computer games. Later, creating programs grew into a hobby, and then into a profession, which was greatly facilitated by entering ITMO. After my first job, I realized that I was not so interested in developing products for users, I liked making infrastructure, various tools much more, I realized this while working on high-load infrastructure at VK. I liked what was closer to the network stack, operating system, data storage. The Go language is just perfect for these tasks, it is what I use in my daily work, now at T-bank
Can you elaborate on the vulnerability scanning system you designed for Docker images at T-Bank? What were the key challenges and how did you overcome them?
Answer here:
Since we are developing our own container registry for an internal development platform at T-Bank, we needed a mechanism for scanning images for vulnerabilities, such as implemented in docker.hub.
We used open source tools to generate SBOM and then analyze it for vulnerabilities. As well as a microservice architecture, where there is a cluster of worker services and separate instances of coordinators.
The main difficulties were working with multi-platform images, scanning images that contain many files or are simply large in size.
You led the migration of over a million Docker images across container registries. What strategies did you employ to ensure a smooth transition, and what lessons did you learn from this process?
Answer here:
First, we divided the process into two stages: transferring images and switching traffic. Thus, first the images were transferred, and then the traffic was switched. The second point is that we organized the switching in parts: we switched in groups, starting with the least critical ones. The most important thing is that immediately before switching, we launched automated checks of the state of the two container registries, and only if they matched, we performed the switching. We also had a mechanism for switching in the opposite direction in case of force majeure.
The switchover was quite smooth, although there were problems with some images. The most important lesson is that you can’t foresee everything at once, so it is necessary to carry out such migrations in parts, gradually increasing the share of traffic and simultaneously fixing the problems that arise.
In your role at VK, you designed and implemented a new API access token format. How did this improve security and efficiency, and what considerations went into its design?
Answer here:
The main thing we were guided by was the security of user sessions. We cannot allow them to be compromised, even though there are additional security mechanisms to access the account.
Accordingly, working closely with information security specialists, we selected encryption algorithms, methods for generating and storing keys, and also took into account all known attacks and vulnerabilities.
The new format made sessions and API tokens more secure, and also allowed us to save additional information in the token, allowing product developers to use this for their own purposes.
Could you walk us through the process of transitioning core systems from monolith to microservices at VK? What were the main benefits and challenges of this transformation?
Answer here:
The main difficulty is the transfer of user data, as well as maintaining the system in working order during the move to microservices.
User data was used in various parts of the social network, so simply transferring part of the code and data is not enough. It is necessary to identify all the places where this data is accessed, to ensure transparent switching. Another big difficulty is to solve the problem with caches, which were used everywhere in the social network. It was necessary to understand their non-trivial logic, and also take them into account when switching.
The main advantages are acceleration of development, the ability to deliver changes separately from the monolith, simplification of development and adaptation of new developers. The entire system becomes more reliable due to the fact that the coherence of the social network components is reduced
As a Senior Golang Developer & SRE at T-Bank, how do you approach designing systems that are resistant to high loads?
Answer here:
In my design practice, I am guided by proven architectural patterns. These can be patterns that I have encountered during my career at work, as well as those that I have gleaned from open sources. In particular, it is useful to periodically follow updates from various large companies regarding their architectures and design solutions. Their blogs, as well as employee presentations at various conferences, are very helpful in this regard.
You’ve developed request rate limiting subsystems for high-load environments. Can you explain the importance of rate limiting and how your implementation has improved system performance?
Answer here:
I will give one case from practice:
The company used the IAM subsystem to authorize requests between microservices.
At some point, one of the dependent systems that used IAM failed and began sending a huge number of requests, thereby disabling IAM, and then almost all dependent systems suffered.
After the introduction of rate-limiting, this never happened again and everything was limited by a notification from the monitoring system 🙂
What emerging trends or technologies in software engineering and distributed systems are you most excited about, and why?
Answer here:
What excites me most are new approaches to writing code with AI.
For example, when you pass a model, say, a repository with code, as input, it can add a new feature to the service.
This does not mention other capabilities related to developer assistance: bug catching, context-based code completion, assistance in writing code, etc.
Given your diverse experience across different companies and roles, what advice would you give to aspiring software engineers looking to specialize in high-load and distributed systems?
Answer here:
It is always important to understand the fundamental principles:
network technologies, computer structure, operating systems, data storage subsystems, distributed systems theory, information systems architecture, databases, and so on.
Technologies are constantly changing, new ones appear, new practices appear, all this can be obtained with experience, but fundamental knowledge always remains relevant.
Looking ahead, how do you envision the role of software engineers evolving in the next 5-10 years, particularly in the context of distributed systems and high-load environments?
Answer here:
There is already a tendency that the role of the developer is gradually becoming less relevant, as the outdated monolithic architecture is being replaced by microservices. Microservices are easy to develop, even a junior developer can handle it. The emphasis is shifting towards engineers who manage the infrastructure for these microservices – DevOps, SREs. With the development of cloud technologies, SaaS solutions, virtualization and containerization technologies, infrastructure is becoming central.
It is also worth keeping in mind the rapid development of AI technologies; it is very likely that development in today’s understanding will develop into something completely different, for example, into the close work of an engineer with AI tools.