Igor Kuzevanov is a senior software engineer at Deutsche Börse, where he develops advanced systems for pricing and settlement of options and futures. He has designed and implemented a hybrid infrastructure that integrates cloud and on-premises environments, ensuring regulatory compliance while maintaining flexibility for development. Igor is known for optimizing core platforms, streamlining developer onboarding, and improving internal workflows, helping teams work more efficiently. With deep expertise in building scalable, resilient financial technology systems, he plays a key role in driving innovation and reliability at Deutsche Börse.
IGOR, many enterprises are still balancing between on-prem and cloud — what do you see as the biggest architectural trade-offs when migrating mission-critical databases to the cloud?
The biggest trade-off is predictability versus flexibility. When you run systems on your own hardware, you get full control — performance is consistent, latency is stable, and you can tune every layer of the stack. That’s essential in trading and clearing, where even milliseconds can have a financial impact. Moving to the cloud gives you agility: you can scale faster, recover from failures more easily, and offload some operational work to the provider. But you also introduce unpredictability. You share resources with other tenants, so latency can spike unexpectedly, and vendor-specific features can create lock-in over time. Migrating terabytes or petabytes of financial data to the cloud is slow and expensive, and once you’re there, moving it again is even harder. That’s why we use a hybrid model: mission-critical components, like trade matching, stay close to the exchange on dedicated hardware, while we leverage the cloud for analytics, backup, and workloads that benefit from elasticity.
How do you approach data consistency in distributed systems where strict ACID guarantees may conflict with performance and scalability requirements?
I look at data in two groups: critical data that moves money or carries legal weight, and everything else. For the first group, we don’t compromise — transactions are fully ACID, writes go through a single leader, and replication is backed by strong consensus. It’s slower, but every trade or settlement is guaranteed to either succeed completely or fail cleanly.
For less critical data, like analytics or dashboards, we aim for speed and scale. We send all changes to an event log and build fast, read-optimized views in the background. Most of these systems work fine with slightly outdated data, so we accept a bit of staleness.
We tie this together with patterns that make distributed systems safe: a transactional outbox to ensure every state change emits an event exactly once, idempotency keys so retries don’t cause duplicates, and workflows with clear “undo” steps. On top of that, we enforce business rules, reconcile systems regularly, and run failure simulations. The idea is simple — be strict and safe where accuracy matters, and design for speed and flexibility everywhere else.
What role do you think hybrid and multi-cloud strategies play in reducing vendor lock-in, and what technical challenges do they introduce for databases?
Hybrid and multi-cloud setups help us avoid putting all our eggs in one basket. They give us leverage over cloud providers, let us choose the best services from each, and add resilience because we’re not tied to a single vendor’s outages or pricing changes. In finance, where neutrality and uptime are critical, that flexibility is valuable.
But running across multiple environments makes databases much harder to manage. Keeping data in sync between providers adds latency, and transferring large volumes of data can be expensive. Each cloud evolves its services differently, so relying too much on a single provider’s managed features risks lock-in anyway. To cope, we stick to open-source database engines, replicate data through change-data-capture pipelines, and keep deployments as portable as possible.
So multi-cloud solves one kind of dependency problem but introduces complexity. It works best when you design for portability from day one, even if that means building and managing more of the infrastructure yourself.
In high-load systems, latency spikes are often more damaging than average response time — what patterns or tools do you recommend for monitoring and minimizing tail latency?
Don’t chase the average — watch the worst cases. We always track the slowest 1% and 0.1% of requests, not just the median. For that, we use high-resolution histograms and end-to-end tracing (every request carries an ID so we can see its full path). Load tests must mimic real traffic bursts so they don’t hide pauses.
To keep tails short, we design for control: strict timeouts and per-request deadlines, retries with jitter (never hammer the same hot spot), admission control to keep queues from growing, and adaptive concurrency so a service only takes on as much work as it can safely finish. When pressure rises, we shed non-critical traffic first and protect the trading path.
Placement matters too: keep data close to where it’s used, warm caches before peaks, and avoid unnecessary cross-service hops. On the runtime side, use connection pooling and pre-allocated worker threads to avoid spikes from setup costs. The mindset is simple—measure the slow parts precisely, and give the system guardrails so it degrades gracefully instead of stalling.
Speaking about high-load systems, what are the most effective architectural approaches for handling sudden traffic surges, such as those caused by market volatility or large events in centralized finance?
Plan for the worst day, not the average one. We size for “the biggest volatility day ×2” and keep pre-warmed capacity ready so we’re not waiting for instances to boot when the spike hits. Traffic first lands in durable queues so we can smooth bursts, and all consumers are idempotent—they can retry safely without double-processing.
We run services in surge mode: strict backpressure and rate limits, adaptive concurrency so each service only takes work it can finish, and priority lanes that always reserve capacity for order intake and confirmations. Less critical features automatically degrade (slower dashboards, delayed enrichment) to protect the trading path.
On the data side, we partition aggressively so hot shards can scale out fast. Writes go through an append-only log or short batching stage to protect databases from thundering herds, while reads lean on warm caches and replicas. Circuit breakers and bulkheads isolate trouble so one overloaded component doesn’t sink the rest.
Finally, we make it muscle memory: predictive scaling tied to market signals, game-day drills with shadow traffic, and ready kill-switches/feature flags to shed load in seconds. The goal isn’t just to survive the surge—it’s to keep the core flow fast while everything else gracefully steps aside.
How do you design fault-tolerant systems that can not only recover from failures but also maintain service continuity under extreme load?
I assume things will fail at the worst possible time, so the design has to keep working while parts break. Critical services run in more than one data center at once, and each has a clear, automatic failover path. We replicate data continuously and use short, well-defined transactions, so a crash never leaves half-finished work. Every external effect has an idempotency key, so retries don’t double-book a trade. If a component slows down, backpressure kicks in early, queues stay bounded, and we shed non-essential traffic before it harms the core flow.
We also separate roles so no single hotspot can take us down: control data uses quorum replication; high-volume events go through an append-only log; read-heavy features hit replicas and caches. Services have circuit breakers and “bulkheads” to isolate faults. During extreme load, the platform enters a protected mode: priority lanes for order intake and confirmations, delayed analytics and enrichment, and read-only fallbacks where that’s safer than failing.
Recovery is rehearsed, not improvised. We run chaos drills, simulate network splits, and measure recovery time and data loss against strict targets. Write-ahead logs and checkpoints make replays fast and deterministic, and every change (deploy, schema,
certificates) is reversible. The result is a system that doesn’t just recover after an incident—it keeps serving the critical path while it heals.
In your view, what are the most underrated optimization techniques that can significantly improve throughput in large-scale systems?
The biggest wins often come from unglamorous basics. Batch work whenever you can—processing 100 items at once is dramatically cheaper than 100 single updates, for both databases and networks. Cut round trips between services by bundling requests and keeping hot data close to where it’s used; chatty calls kill throughput. Tune your data model: keep hot tables lean, avoid unnecessary indexes on write-heavy paths, and separate frequently used fields from rarely used ones so you’re not hauling extra bytes on every operation.
Reuse expensive things: connection pools, prepared statements, and pre-allocated workers remove per-request setup costs. Cache smartly, but invalidate predictably; a small, well-placed cache in front of a busy endpoint can free entire machines. Compress where it helps—for large payloads or logs, light compression can reduce bandwidth and I/O more than it costs in CPU. And don’t forget idempotent retries and short timeouts: they prevent pileups and keep the system moving under stress. None of this is flashy, but together it can double effective throughput without adding hardware.
When it comes to centralized finance & market infrastructure, we see that traditional financial institutions like Deutsche Börse Group must guarantee near-zero downtime — how does this requirement influence database and system design compared to other industries?
“Near-zero downtime” changes your defaults. You stop planning around maintenance windows and start designing so everything can happen while the system stays live: schema changes, deployments, failovers, certificate renewals, even hardware swaps. That pushes you toward blue/green rollouts with shadow traffic, instant rollback, and compatibility between old and new versions so you can upgrade without pausing trading.
Databases follow the same rule. We avoid long locks and risky migrations; we break changes into small, reversible steps and keep transactions short. Writes are protected by strong replication, and we run in multiple sites at once so a single failure doesn’t take us down. Read traffic can move to replicas if the primary is busy, and caches buy us time when a shard is hot. Every external effect uses idempotency keys so retries won’t duplicate a trade.
Operations are treated like product features. Health checks are deep, alerts are tied to clear service-level objectives, and runbooks are automated. We rehearse outages with game days, measure real recovery times, and only ship changes we can undo in
seconds. Other industries can live with planned downtime; in market infrastructure, the market clock is the clock—so the system must be built to change while running.
Regulatory compliance (e.g., MiFID II, GDPR) adds complexity to system design — how do you ensure compliance without sacrificing performance in trading or clearing systems?
We bake compliance into the architecture so it doesn’t sit on the hot trading path. First, we minimize and isolate sensitive data. Personal data is tokenized or encrypted and kept in its own stores, with strict access controls and data-residency rules so EU data stays in the EU. Trading flows only see the tokens they need to operate, not the raw identities.
Second, we separate audit from execution. Orders and trades complete fast; detailed, immutable audit records are written asynchronously to an append-only log that satisfies retention and tamper-evidence requirements. Surveillance, reporting, and reconciliation read from that log without touching the order path, so compliance work never slows a match or confirmation.
Third, we make rules automatic and testable. Policies for retention, deletion (GDPR “right to erasure”), access, and segregation of duties are codified and enforced by pipelines—every build and deploy is checked before it goes live. Data lineage is tracked end-to-end so we can answer “who touched what, where, and when” on demand.
Finally, security is tuned for low latency: encryption everywhere with hardware support, short-lived certificates rotated online, and authorization decisions cached safely at the edge to avoid round-trip delays. We routinely rehearse audits and incident scenarios to prove we can meet MiFID II record-keeping and GDPR obligations without adding milliseconds to the trade path.
Facing the future trends, what technologies or paradigms (e.g., serverless databases, cloud-native transaction systems, new consensus protocols) do you believe will have the biggest impact on large-scale financial platforms in the next 3–5 years?
I think the next few years will bring a real shift in how we design financial systems. Cloud-native transactional databases are maturing quickly, offering regional writes with near-global reads, which will simplify some of the hard problems we currently solve with custom replication and sharding. We’ll also see more adoption of
event-driven architectures where an append-only log becomes the system of record, and databases are just projections of that log. This makes auditing, replaying, and building new services far easier.
Hardware will also play a bigger role. Specialized chips like SmartNICs and DPUs are starting to offload encryption, compression, and network processing, freeing CPUs to focus on business logic. At scale, this kind of offload could make trading systems both faster and more energy efficient.
Consensus protocols are evolving too. We’re already experimenting with next-generation versions of Raft and Paxos that reduce cross-region latency. In areas like clearing and settlement, we may even see new hybrid consensus models designed specifically for regulated finance rather than adapting general-purpose algorithms.
And while serverless databases might not replace high-frequency trading cores anytime soon, they’ll become increasingly attractive for analytics, compliance, and auxiliary services where elasticity matters more than raw speed. In short, financial platforms will be built less like giant monoliths and more like highly distributed, hardware-accelerated, log-centric systems that are easier to scale, audit, and evolve.
Sustainability became a major focus by early this year — how should high-load financial infrastructures adapt their architectures to balance performance with energy efficiency?
Sustainability is now a design goal, not just a cost-saving measure. In finance, we can’t compromise on latency for trading, but we can architect the broader platform to be far more energy-aware. That starts with right-sizing capacity—instead of permanently running everything at peak scale, we scale non-critical workloads dynamically and consolidate services during quiet periods. Batch jobs like reporting or backtesting can be scheduled for times when electricity is greener, and analytics pipelines can slow slightly if it saves megawatts.
We’re also experimenting with energy-efficient hardware for workloads that don’t need microsecond-level response times. ARM-based servers, DPUs, and SmartNICs handle encryption and packet processing at lower power than CPUs. Storage tiers are optimized too: hot data stays close to compute, while colder data moves to more energy-efficient storage.
Visibility is key: we now measure energy per transaction, not just CPU or latency, and that drives architectural decisions. Compression, caching, and data pruning reduce I/O and network hops, which cuts energy use as well as cost. High-frequency trading will always prioritize speed, but everything around it—from analytics to regulatory reporting—can be designed with sustainability in mind. Over time, the “green footprint” of a trade will become a competitive metric, just like latency.
If you were building a new market infrastructure platform from scratch today, what architectural principles would you make non-negotiable?
I’d start by writing down clear reliability targets and designing everything to meet them. Every request must be traceable end-to-end, and we should know, in real time, if we’re drifting from our goals. Changes must be safe to make while the system is running: rollouts are blue/green with shadow traffic, and every change is reversible in seconds. The core that moves money is strictly consistent, uses short transactions, and replicates strongly; everything else is built on events and fast read models that can be slightly out of date.
I’d make overload safety a first-class feature. Services accept only as much work as they can finish, apply backpressure early, and degrade non-critical features automatically during spikes. Data is partitioned from day one so we can scale by adding shards, not by redesigning. Security follows zero-trust principles with strong identity, encryption everywhere, and least-privilege access. Finally, operability is a product requirement: clean APIs, predictable failure modes, runbooks that are automated, and regular disaster drills. If we hold that line, the platform can evolve quickly without risking the market’s trust.
