Computer vision is moving from lab demos into everyday operations, but the hardest part is not getting a model to work once. It is getting it to behave predictably in messy, real environments where cameras, customers, store staff, and privacy expectations all collide. That tension is only getting sharper as vision and language systems merge into a single workflow. Gartner predicts that 40% of generative AI solutions will be multimodal by 2027, which means more products will need to reason across text, images, audio, and video without slowing down or overreaching.
Sanyam Mehra is a machine learning engineer at Instagram and a Senior IEEE member, and IEEE journal author. Before that, he was one of the first AI technical hire at Focal Systems, where he built and scaled an end to end deep learning stack for automated retail checkout and real time inventory monitoring, presented the technology at NRF 2019, and deployed more than 70,000 edge devices across 750+ retail stores spanning four countries.
Sanyam, thanks for joining us. In simple terms, what does “privacy first computer vision” actually mean in retail, and why does it get harder at scale?
Privacy first means you treat visual data as sensitive by default, not as something you clean up after you have already built the pipeline. In a store, the model is not operating on curated frames. It is operating on real-world date that captures shoppers, real lighting, and real operational constraints. If your system requires broad access to raw video to function, you will end up in constant tradeoffs that slow deployment and weaken trust.
Scale makes it harder because the edge is not one environment. There are hundreds of environments. You are dealing with different store layouts, camera placements, and day to day variance. If you want reliable outcomes, you need a design that keeps the heavy lifting close to where the data is produced, and you need a clean story for what leaves the store, why it leaves, and how it is protected.
You joined Focal Systems as one of the first AI technical hire. What did you have to get right early so the system could actually ship, not just demo?
I had to think in systems, not models. The model mattered, but the surrounding stack mattered more. We built a hybrid cloud and on premises backend that could support scalable training while also handling high throughput video stream processing and reliable inference at the edge. If you cannot move data through a stable pipeline, you end up spending your time chasing broken assumptions instead of improving accuracy.
The other early decision was to design for operations from day one. Retail partners do not care that a model is clever if it is brittle in the field. We focused on making deployment and management robust across partners, because reliability is what turns a pilot into a platform.
Running ML on tens of thousands of edge devices sounds like a reliability problem as much as an ML problem. What made the deployment manageable?
You have to assume the world will drift. Hardware will behave differently, stores will change, and the environment will surprise you. So you need disciplined rollout patterns, clear versioning, and a way to understand what changed when behavior shifts. In practice, that means treating deployment like product infrastructure, not a side task for the ML team.
We also optimized workflows so the system could be managed across very different partners without needing a bespoke process each time. The goal is to reduce the operational surface area. When you do that well, the team can spend its energy improving outcomes instead of constantly re learning the same lessons at every site.
You built a privacy first computer vision pipeline with over 99% compliant processing for in store images and video. How did you approach PII so privacy stayed a design constraint, not a later patch?
We built the pipeline so PII handling was part of the core flow. The practical approach was to use instance segmentation and system level controls to detect and redact sensitive elements reliably. The point is not just to have a redaction model. The point is to make sure the entire data path respects that boundary consistently.
This matters because the market is pushing more vision into more settings. Recent research projects that AI in the computer vision market will reach $63.48 billion by 2030, which means the teams that win will be the ones that can scale adoption without turning privacy into a constant exception process.
How did you keep training data useful without turning stores into data collection projects?
You have to be intentional about what you collect and why. In retail, raw video can quickly become a liability if you do not have a clear purpose and a disciplined retention story. So you prioritize what supports training and evaluation, and you design the system to keep exposure minimal while still allowing the models to improve.
Just as importantly, you build feedback loops that respect store operations. If your training pipeline depends on breaking normal workflows, it will not survive contact with real deployments. The best systems are the ones that improve quietly in the background while the store keeps running.
Focal’s work was showcased at NRF 2019 and subsequently led to a strategic investment and partnership with Zebra Ventures. What separates a demo that looks impressive from a platform a retailer will actually adopt?
A demo proves the possibility. Adoption proves reliability. Retail is unforgiving because the system has to work during peak hours, under imperfect conditions, and with minimal tolerance for disruption. The difference usually comes down to whether you engineered the operational details, not whether the model is flashy.
It also comes down to integration reality. Retail partners need something that fits into how they run stores, not something that forces a reinvention of their daily routines. When you build with that respect for operations, partnerships become possible because the value is clear and the risk feels bounded.
Before retail, you built a generative AI simulation workbench at Schlumberger that supported over 8,000 trainees annually, reduced training operational expenditure by 16% or $4M per year, and cut trainee safety incidents by 34% year over year. You’ve also published scholarly work on Corporate Strategy for Secure Semiconductor Supply Chains: ML-Driven Risk and Market Intelligence. What did those experiences teach you about building ML for high-stakes environments?
It taught me that outcomes matter more than sophistication. In training, you can measure whether people learn faster, whether incidents go down, and whether the program scales without compromising safety. The simulation work was valuable because it created repeatable practice without the cost and risk of physical training in dangerous environments.
That experience reinforced a simple lesson: if the system touches real behavior, the bar is higher. You cannot hide behind average metrics. You have to design for the edge cases and prove the system holds up when conditions are imperfect.
Last question. Now that you are at Instagram, what principles from all of this carry over, especially as more AI systems become visual and context aware?
The principle is that trust is built into the architecture. It is not a policy statement you add later. If you treat privacy, safety, and reliability as first order requirements, you can move faster because you are not constantly renegotiating the fundamentals.
That is getting more important as regulation and expectations expand. A World Bank analysis cites 167 countries with data protection legislation, which is a reminder that “we will figure it out later” stops working the moment your system leaves a single market.
I also try to keep a systems mindset even when the work gets specialized. I tend to look for the same things in AI systems that I look for in production engineering: constraint honesty, operational evidence, and predictable behavior when the environment stops being friendly. That lens also shapes how I engage with the field more broadly, including my role as a paper reviewer for SARC Journals, where rigor and real-world applicability matter more than theoretical elegance.
If there is a common thread across retail, training, and social platforms, it is that the system has to earn trust the same way it earns adoption. It has to work under real conditions, it has to respect privacy without constant exceptions, and it has to be operable by the teams who live with it every day. When that foundation is in place, the models get better faster, the rollout gets calmer, and “privacy first” stops being a slogan and becomes a normal way of shipping.