Artificial intelligence

Why the AI Systems Defending U.S. Networks Are Failing in the Field

By Miller V

Posted on February 4, 2025

A UNLV researcher argues that the security industry is measuring the wrong things, and that connected critical infrastructure is paying the price.

Every hospital, power grid, and municipal water system in America now runs on a network of connected devices. And according to cybersecurity researcher Oluwapelumi Bankole, almost none of them are adequately protected, not because better tools don’t exist, but because the tools that do exist are being evaluated the wrong way.

Bankole, a graduate researcher in information systems and cybersecurity at the University of Nevada, Las Vegas, has spent the past several years studying how AI-based intrusion detection systems perform when they move from the laboratory into the real world. His conclusion is uncomfortable: most of them don’t.

“The numbers you see in research papers look impressive,” he told me during a recent interview. “Ninety-eight percent accuracy, ninety-nine percent accuracy. But those numbers are coming from datasets that have nothing to do with what a real IoT network looks like.”

“The numbers you see in research papers look impressive. But those numbers are coming from datasets that have nothing to do with what a real IoT network looks like.”

Bankole’s research focuses specifically on intrusion detection systems for IoT and cloud environments, with a particular emphasis on the gap between how these systems are built and how they actually need to perform once deployed. I sat down with him to understand why that gap exists, and what closing it would require.

On why lab results don’t translate to real deployments

Q: When you look at the performance claims from vendors or research papers, what’s the first thing you notice that concerns you?

Bankole: The metric. Almost everyone optimizes for overall accuracy, and overall accuracy is a terrible measure for intrusion detection. On a real network, normal traffic is the overwhelming majority of what you see. If a system classifies everything as normal, it scores extremely high on accuracy and catches exactly zero attacks. That’s not a security system. That’s a false sense of security.

Q: Why does this problem persist if it’s so well understood in the research community?

Bankole: Because fixing it is inconvenient. The benchmark datasets everyone uses to train and test these models are heavily preprocessed. They have relatively balanced distributions of normal traffic and attack traffic. That makes for clean experiments and publishable numbers. But it means the model is being trained on a fictional version of the network it will eventually have to protect. When you actually deploy it on a hospital network or a utility’s industrial control system, the traffic looks completely different.

Q: Can you describe specifically what that difference looks like?

Bankole: On a real IoT network, attacks might represent a tenth of a percent of total traffic, sometimes much less. The model has never learned what genuine rarity looks like. It’s like training a fraud detection system on a dataset where half the transactions are fraudulent, and then deploying it on real consumer data where fraud is one in ten thousand. The model isn’t prepared for how normal normal actually is.

On the multi-dimensional evaluation problem

Q: You’ve argued that the industry needs a fundamentally different evaluation framework. What does that mean in practice?

Bankole: Right now, most evaluations pick one metric, maybe two, and optimize everything around it. But a real-world deployment has at least four competing demands that all matter simultaneously. You need high detection accuracy. You need low false positive rates, because alert fatigue is a real operational problem. You need the system to run efficiently on constrained hardware, because IoT devices are not powerful machines. And you need the system to adapt when the attack landscape changes, which

it constantly does. No single number captures all of that. A vendor can have a phenomenal F1 score on a benchmark dataset and still be completely unusable in production.

Q: What would a better evaluation framework actually look like?

Bankole: It would start with testing on realistic traffic distributions, not balanced benchmarks. It would require vendors to report recall specifically on low-frequency attack categories, because those are the ones that cause the largest damage when missed. It would include computational budget constraints as a first-class evaluation criterion, not an afterthought. And it would involve longitudinal testing, measuring performance not just at deployment, but six months and a year later, when traffic patterns have shifted.

“A vendor can have a phenomenal F1 score on a benchmark dataset and still be completely unusable in production.”

On critical infrastructure and national urgency

Q: When you think about the stakes, which sectors concern you most right now?

Bankole: Healthcare and energy, in that order. Hospitals have enormous IoT footprints. Connected infusion pumps, patient monitoring systems, imaging equipment, building management. And many of these devices run on legacy firmware that was never designed to be secured. A missed detection in a hospital network is not an IT problem. It has direct patient safety implications. The energy sector is similarly exposed. The U.S. power grid has millions of connected devices, and many of them lack basic patching schedules. If the intrusion detection protecting those systems can’t reliably flag a sophisticated attack because it was trained to favor accuracy over recall on rare threats, that’s a serious national vulnerability.

Q: What would it take for this to actually change at a policy level?

Bankole: Federal agencies like CISA and NIST have done the conceptual work. The frameworks are solid. What is missing is the translation from framework to specific, testable performance criteria for AI-based intrusion detection systems in IoT environments. Right now, an organization can purchase a security product, the vendor can show impressive benchmark numbers, and there is no standard way to verify whether those numbers reflect anything close to real-world performance. Establishing procurement standards that require multi-dimensional performance evidence would change the incentive structure overnight. Vendors would start building for real deployment conditions because that would be what buyers are required to ask for.

Bankole’s research arrives at a moment when federal investment in critical infrastructure cybersecurity is accelerating. Whether the procurement standards he describes follow is, for now, an open question.

What is not open is whether the current evaluation practices are adequate. The research suggests, fairly clearly, that they are not.

Oluwapelumi Bankole is a researcher in information systems and cybersecurity at the University of Nevada, Las Vegas, where his work focuses on AI-driven intrusion detection systems for IoT and cloud infrastructure. He holds a dual master’s degree in Management Information Systems and Cybersecurity from UNLV.