Modern book apps make switching between reading and listening feel obvious. You read on the train, tap a button, and your car speakers pick up exactly where you left off.
When I worked in the industry in 2015, it didn’t exist.
That year, the MyBook team built one of the first working systems to sync book text with professional studio narration. What started as a hackathon experiment became an industry standard that now extends into education and media.
The Gap Between Text and Audio
In 2014, ebooks and audiobooks were disconnected products. Different apps, different catalogs, different licensing deals. Publishers and users treated them as entirely separate.
MyBook had launched in 2012 as a subscription service for digital books, combining a reading app with social features where users could share what they were reading. By 2014, they’d added an audiobook section as a standalone platform within the app. The audio service was technically simpler to build. Audiobooks have fewer metadata fields and cleaner file structures. Digital books require more complex formatting and metadata requirements.
Testing showed demand for audiobooks. The subscriptions grew steadily as users discovered the audio catalog. Then came unexpected feedback: let us switch between formats without losing our place.
Read during morning commute when you can concentrate. Listen while driving when you can’t look at a screen. Switch back to reading in a cafe where you want to highlight passages. The formats didn’t talk to each other. Starting the audiobook meant hunting through chapters to find where you’d stopped reading, often settling for “approximately the right chapter” and losing several minutes of content.
At that time, Amazon, the industry leader, had Whispersync for select Kindle titles, but it worked only within their ecosystem and covered limited books. Most platforms treated digital books and audiobooks as fundamentally different products with no expectation they should connect.
Professional narrators don’t read mechanically. They pause for dramatic effect. They vary pacing between action scenes and descriptive passages. They emphasize different words than a text-to-speech engine. A 300-page book becomes a 12-hour recording split across multiple files. Automatically mapping every sentence to its exact timestamp seemed impossible at scale.
Building the Prototype
During a MyBook hackathon where teams could experiment with risky ideas, two engineers decided to tackle text-to-audio synchronization.
They found an academic paper describing a theoretical approach to audio-text matching. The paper outlined a mathematical concept for creating an audio “fingerprint” for each text segment, then finding where it appears in a professional recording.
The system had three steps.
First, split the book text into processable segments. Run each through a text-to-speech engine to generate synthetic audio. This isn’t the audio users hear. It’s a reference version for comparison purposes only.
Second, build waveform graphs for each synthetic segment. These graphs plot audio energy over time with peaks and valleys representing volume and frequency patterns. Each sentence creates a unique audio signature, like a fingerprint that can be identified in the professional recording.
Third, compare these synthetic patterns against the actual audiobook recording. Pattern-matching algorithms search for similar shapes. When the synthetic waveform for a paragraph closely matches a section at timestamp 14:23 in the professional recording, you’ve found your sync point.
“We literally aligned the graph of the text with the audio file to let the reader move between formats,” one of the engineers explains. “At the time, it looked crazy; today, it’s just expected.”
Production implementation brought additional challenges. Audiobooks typically split into chapters or parts. Sometimes dozens of separate files for a long book. The system needed to work in stages: first identify which audio file contains the text segment, then pinpoint the exact timestamp within that file.
They added machine learning models to improve accuracy. The initial pattern matching worked for standard cases, but ML helped handle variations. Some narrators pause significantly longer between sentences. Some read much faster or slower than TTS engines predict. Some add vocal effects or character voices that throw off basic matching. The ML layer learned to recognize and account for these narrator-specific patterns.
The hackathon prototype proved the concept worked. The team spent several months rebuilding it for production, processing MyBook’s entire catalog of tens of thousands of books.
User Response and Market Impact
MyBook launched the feature and users adopted it immediately. Usage patterns showed something interesting: people don’t mentally separate content into “text” vs “audio.” They think about the story they’re consuming, and format is just a constraint of their current environment. Reading on the subway and listening in the car aren’t different activities. They’re continuing the same book in different environments.
MyBook had an advantage in coverage. Amazon’s Whispersync worked only for titles available in both Kindle and Audible formats, and only within Amazon’s closed ecosystem. MyBook processed their own entire catalog automatically. If both formats existed on the platform, syncing worked without special setup.
The business model shifted. Users who subscribed for ebooks suddenly had access to audio at no additional cost. A book purchased in one format became effectively available in both. Competing services that didn’t offer format flexibility started feeling limited by comparison.
Within two years, other platforms adopted similar technical approaches. The feature went from experimental to expected. Book apps that couldn’t switch formats felt outdated.
Beyond Books
The same technical approach spread to other content types, solving a consistent problem: people consume information in different contexts with different constraints.
EdTech platforms adopted it for lecture materials. Students take notes during class, then listen to recordings while commuting. Automatic syncing between written notes and audio timeline means tapping a note jumps to that exact moment in the recording.
Podcast apps added transcripts with timestamp syncing. Read an interview at your desk when you can’t wear headphones, then switch to audio at the gym. The app starts playing where you stopped reading.
News organizations tried hybrid formats. Video platforms added it for educational content. The pattern repeats: people need visual focus for complex information sometimes, but their hands are busy other times. They’re in noisy environments where audio doesn’t work, or they want to skim quickly, which only text supports.
Before format syncing, switching contexts meant struggling with the wrong format or abandoning content. Commuters who started reading at home often didn’t finish books because continuing in audio during their drive required too much friction.
From Innovation to Infrastructure
Ten years later, format switching is standard in book apps. Behind the simple button sits complex technical work: pattern matching algorithms, multi-stage file search, ML models adjusting for narrator variations.
The evolution followed a familiar path. An academic paper described what might be possible. MyBook turned that possibility into working software. Other platforms saw it work and built their own versions. Each iteration refined the approach until format flexibility stopped being a feature and became basic infrastructure.
Users didn’t need convincing that switching between reading and listening was valuable. They already wanted it. What they needed was for someone to solve the hard technical problems that made it possible.
Fedor Nasyrov
Development Team Lead at Exness
Fedor has over 10+ years of experience in software development and product engineering. Prior to Exness, he led development teams at MyBook, where he worked on subscription products, international expansion across EU markets (including GDPR compliance and local payments), and contributed to building and scaling a large digital reading platform. His technical background includes Python, Django, PostgreSQL, and JavaScript.