Tech News

Piyush Tiwari: “The Compliance Isn’t a Layer on Top. It’s in the Reasoning Architecture.”

A senior engineering manager on what breaks when AI systems move from testing to production, and why the data platform underneath is usually the real bottleneck.

Published: April 2026  |  TechBullion

As companies transition AI systems from sandbox to production, one common failure mode has become apparent: AI models that work in test settings fail when they reach production, where they are fed real infrastructure. The limiting factor is rarely the model. It’s more likely the underlying data infrastructure, such as the ingestion layer, the data streaming platform, the data governance system, or lack thereof.

Piyush Tiwari has long worked there. He is a Senior Software Engineering Manager at Wayfair, where he has spent almost seven years working in various areas of the company’s data and networking stack. He also worked in the past on building SOX-compliant data systems for financial services companies at Accenture and Atos Syntel, experiences that gave him a governance-oriented view of the world that he brings to both enterprise AI infrastructure and a proptech startup he founded. We asked him about the problems that arise when AI goes into production, and what he’s doing to solve them.

Q: You’ve worked in four infrastructure domains at Wayfair. What does that mean?

The domains are different parts of the production stack. First, there is Scribe, Wayfair’s first party events tracking system. It reaches more than 300,000 events per second, more than 20 terabytes per day. Wayfair developed it internally to manage the level of data granularity, privacy and segmentation.

The second was the distributed real-time streaming layer – multiple data centres, 99.9% SLO. If this is slow, or loses data, the downstream models won’t be making decisions on all the data. You can’t get around this with better models.

And then the ML platform – search, recommendations, operational intelligence. And now edge and cloud networking – CDN, load balancing, cloud. The front to everything else.

Q: When you talk about AI failing because of the infrastructure, can you give an example of that?

Here’s one. A recommendation model trains on the latest behavioral data – views, add to cart, abandoning a session. If the streaming layer is experiencing delays, the model retrains on hours-old signals instead of minutes-old signals. The model hasn’t changed. The training pipeline hasn’t changed. But the recommendations lost quality because the data couldn’t be refreshed quickly enough.

Or consider event tracking. If Scribe loses events during a heavy traffic event (Black Friday, a flash sale), processing AI systems are looking at a fuzzy picture of what a user is doing. They may not weight a hot product category correctly or see a pattern in conversions. The model is fine. The data is wrong.

Q: You helped create Data Mesh at Wayfair. What did it solve for?

A centralized data model was the problem. One team was responsible for preparing the data for each AI model, for each analytics job, for each report. Requests queued. The data preparers didn’t necessarily have domain knowledge.

Data Mesh distributes ownership. Data owners are responsible for the data as a product – with standards, SLAs and governance. The platform teams offer infrastructure and constraints. The model was discussed at the Shift Left Data Conference, with discussion around how this can be done at scale.

Q: You have established Data Contracts framework. How does that play into Data Mesh?

Data Mesh gives you distributed ownership. But you need enforcement. Data Contracts provides clear statements between producers and consumers: what fields are there, what are their formats, what is the freshness of the data.

Without it, a typical incident is: your AI model is producing strange results, you chase it through three pipelines and find that an upstream team modified a field format three weeks ago. Data Contracts would catch that break in the pipeline. It served as a foundation for analytics and AI at Wayfair.

Q: You also initiated a migration and cost optimization. How did the visibility change?

The migration was of critical workloads from on-premise to Google Cloud Platform; not lift-and-shift, but a full modernisation. When you move workloads to the cloud with observability, you can see wasted resources. Unneccessary compute resources for peak demand twice a year. Storage not accessed for months. Duplicated ML training runs. The optimisation saved millions of dollars each year.

Q: Let’s shift to ResidenceHive. How does the production infrastructure thinking translate to a proptech startup?

Same principles, different scale. ResidenceHive is a first-response layer for real estate brokerages. A lead comes in through WhatsApp and the AI agent responds within seconds—qualifies the buyer, extracts intent, matches against MLS listings, generates a brief.

The infrastructure problem is that this happens in a regulated environment. U.S. fair housing laws prohibit discrimination based on race, religion, family status, disability. So the system has architectural constraints that prevent demographic steering—not a content filter, but a decision logic structure where those pathways don’t exist. Every interaction generates an auditable log. Mandated disclosures are embedded in the conversation flow. The compliance isn’t a layer on top. It’s in the reasoning architecture.

Q: How is that different from what other real estate AI tools are doing?

Most tools optimize purely for speed. The compliance question is either handled after the fact or not at all. The NAR says over 100 million leads flow through U.S. CRMs annually. The average agent takes over 15 hours to respond. So there’s massive pressure to automate. But the tools doing that automation mostly have zero architectural awareness of fair housing requirements. They’re fast. They’re also potentially liable.

The platform is in early investor discussions and pursuing proptech platform integration.

Q: You’ve published research that formalizes this approach. What’s the core framework?

The research—published in the International Journal of Humanities and Information Technology—argues that compliance in regulated industries can’t be applied after an AI system is built. It has to be embedded in how the system reasons and records its decisions. The framework formalizes three principles: auditable reasoning, regulatory guardrails, and decision traceability.

Those principles grew out of earlier work on SOX-compliant systems in financial services. The FTC’s March 2026 policy statement now requires decision logging and audit trails when AI interacts with consumers. Colorado’s AI Act takes effect in June with algorithmic discrimination prevention requirements. The direction is clear—regulators are moving toward exactly what this framework describes.

Q: You also review talks and case studies for TMLS. What patterns do you see in how organizations approach AI deployment?

TMLS has over 16,000 members—one of the largest AI communities in Canada. The pattern that comes up most often in the talks and case studies is strong model work with no production plan. Impressive training results, no governance architecture, no thought about real-time data quality or regulatory constraints. A team presents a model that performs beautifully on a benchmark, and when the question is how it handles stale data in production, there’s no answer. That gap is where most AI projects die.

Q: What’s the most common infrastructure mistake you see companies make when deploying AI?

Treating the data platform as static. They plan for compute scaling—more GPUs, bigger clusters—but assume the data infrastructure will just keep up. It won’t. When you go from thousands of requests in testing to millions in production, every layer gets stressed differently. Streaming hits throughput limits. Storage can’t handle concurrent reads. The governance layer—if it exists—becomes the bottleneck because nobody designed it for real-time operation.

The companies that get this right build the data platform and governance layer before the model. The ones that get it wrong discover the infrastructure problems in production. By then the damage—bad recommendations, compliance failures, unreliable outputs—has already happened.

Comments
To Top

Pin It on Pinterest

Share This