Artificial intelligence

Advanced AI-Driven Malware Detection and Classification Using Multimodal Security Signals

By Rengarajan Jegadeesan

Posted on July 14, 2025

Advanced AI-Driven Malware Detection and Classification Using Multimodal Security Signals

The increasing volume and complexity of cyberattacks utilizing images, URLs, and QR codes have introduced significant challenges for existing security systems. Traditional signature-based and rule-driven methods frequently fail to detect these advanced threats, especially when malicious payloads are hidden within visual elements or embedded through layered redirection techniques. In response, this work proposes an AI-driven multimodal malware detection framework that analyzes a variety of digital artifacts within a unified intelligent pipeline. The framework extracts complementary security signals from each modality—including lexical and domain-level features from URLs, entropy and statistical metrics from images, and embedded content characteristics from QR codes. These multimodal features are then processed by a combination of machine learning and deep learning models, which collaboratively assess the likelihood of malicious intent. Experimental results demonstrate that the proposed system is both reliable and scalable, effectively detecting sophisticated malware campaigns across multiple attack surfaces. The findings confirm that integrating multimodal feature extraction with AI-based classifiers substantially strengthens defenses against modern cyber threats.

Introduction

As digital platforms grow, cyberattacks using malware hidden in images, URLs, and QR codes have become more advanced. Attackers use methods like obfuscation, steganography, dynamic URLs, and multi-stage phishing to bypass traditional security tools. Images may contain embedded scripts or altered metadata, URLs employ fast-flux hosting and redirect chains, and QR codes hide malicious links, complicating threat detection. These tactics expose the limits of single-source analysis and underscore the need for multimodal threat intelligence.

Although AI and machine learning can identify complex threats across different data types, most existing solutions focus on just one kind of content, making coordinated attacks harder to detect. This research introduces an AI-powered multimodal malware detection system that analyzes URLs, images, and QR codes together, extracting features from each and applying various machine learning model Random Forest, Gradient Boosting, CatBoost, Logistic Regression, and multilayer perceptron to improve accuracy.

Major contributions include building a unified attack dataset, designing an integrated AI pipeline for multimodal analysis, and evaluating model performance to identify the best approach. The system delivers reliable detection and offers a scalable foundation for future cyber defense.

Related Work

Traditional Approaches to Malware Detection

Malware detection has focused on analyzing single artifacts like executables or URLs, which limits effectiveness against advanced threats using multiple channels.

Machine Learning and Deep Learning Advances

Machine learning and deep learning, including CNNs for code classification, improve malware analysis but can be inconsistent across different datasets.

QR Code Threats and Detection Techniques

Malicious QR codes are increasingly used in attacks. Dynamic URLs evade simple blacklist systems, so hybrid and lightweight models are employed; algorithms like Random Forest and XGBoost show strong performance on embedded URL analysis.

Limitations of Conventional Systems and the Need for Multimodal Solutions

Signature-based detection fails against polymorphic and zero-day threats. Adaptive, feature-driven models offer better scalability, yet most research is limited to single artifact types. This study proposes a unified AI-driven multimodal system to tackle complex malware campaigns.

Methodology

The methodology introduces an AI-powered multimodal malware detection system that analyzes URLs, images, and QR codes in a unified pipeline. Its modular design supports scalability and interpretability, with each artifact contributing distinct security signals.

Key Components

Data Acquisition: Gather URLs, images, and QR codes from multiple sources to build a diverse dataset.

Multimodal Feature Extraction: Extract relevant features for each artifact type to capture unique security signals.

Model Training: Train machine learning models using the combined features for multimodal analysis.

Decision-Level Fusion: Merge classification results from all modalities to achieve robust malware detection.

This framework delivers effective detection across digital artifact types and provides a scalable base for advancing cyber defense.

4.1 System Overview

The system follows a structured, multi-stage workflow:

Figure 1: System overview showing the extraction of features, classification, scoring and output decision flow.

1) Input Acquisition

Raw URLs, images, and QR codes are ingested from datasets or live scanning modules.
QR codes are decoded to extract embedded URLs or redirection paths.

2) Preprocessing & Normalization

URLs are normalized (lowercasing, token cleanup, parameter stripping).
Images are resized, color-normalized, and stripped of corrupt metadata entries.
QR codes undergo error-correction and format validation before decoding.

3) Modality-Specific Feature Extraction

Each modality produces a structured representation of security signals.

4) AI-Based Classification

Independent ML models are trained per-modality.

5) Fusion Layer

Outputs from all classifiers are aggregated to generate a unified malware prediction.

The proposed multimodal malware detection system formalizes every stage feature extraction, classifier training, and ensemble integration—using mathematical representations. Let each input artifact be represented as:

X={xurl,ximg,xqr}

where xurl, ximg, and xqrdenote the URL, Image, and QR Code modalities.

Malicious URL Detection

Malicious URLs are a major source of various cyber threats, including phishing and malware. The system uses a structured pipeline to extract lexical, domain-level, and content-based indicators from URLs, which are then analyzed by machine-learning classifiers. Its goal is to detect harmful links early, before users interact with them, by learning patterns from large datasets of benign and malicious URLs.

Each URL processed by the system is converted into a fixed-length feature vector. This transformation involves extracting three distinct categories of features:

Lexical features
Domain-level features
Content-level features

A URL is transformed into a fixed-length feature vector:

u=furl(xurl)∈Rdurl

The feature vector is represented as: u='[L ∥ D ∥ C]’, where L, D, and C correspond to the lexical, domain-level, and content features, respectively.

URLs can reveal security risks through their structure and domain data. The system extracts three main types of signals:

(a) Lexical Features

Lexical analysis detects signs of phishing and malware, such as IP addresses instead of domains, excessive dots or directories, special characters (e.g., @, -, %), and suspicious keywords like login or banking.

(b) Domain-Level Features

Domain checks include registration age, expiration patterns, registrar reputation, and DNS anomalies, helping identify freshly registered or unusual domains linked to malicious activity.

(c) Content-Level Features

When possible, content analysis looks for risky JavaScript patterns, hidden forms, external URLs, and metadata like redirect chains.

All signals form a unified feature vector for classification.

Detection uses two machine-learning models:

(a) Random Forest

This ensemble of decision trees predicts maliciousness by averaging individual votes, excels at spotting non-linear patterns, and highlights important features such as “IP address presence” and “domain age.”

(b) Logistic Regression

Logistic regression classifies URLs by applying weights to features and separating good from bad with a linear boundary. It calculates the probability of a URL being malignant using a sigmoid function, and predicts malignant if this exceeds 0.5, otherwise benign. This approach offers fast, traceable classification of malignant URLs.

Figure 2: Confusion matrix visualization (annotated heatmap) for the 4-class Logistic Regression matrix.

Malicious Image Detection

Image analysis involves extracting features such as statistical properties, entropy scores, metadata, and Min-Hash signatures to detect malicious payloads. Entropy analysis helps reveal steganography by identifying abnormal patterns, while Min-Hashing enables efficient similarity checks between image features. Changes in entropy or unusual metadata, such as missing camera or location details, may indicate tampering or concealed threats.

The following categories form the basis of the image feature vector:

An image ximgis processed to extract statistical, entropy-based, and metadata-based vectors:

v=fimg(ximg)∈Rdimg

Let:

S= statistical descriptors (mean, variance, GLCM textures)
H= entropy metrics
M= metadata features
Z= Min-Hash signature vector

v=[S ∥ H ∥ M ∥ Z]

Images can hide payloads, scripts, or steganographic content. The system reviews:

(a) Statistical and Pixel-Level Properties

Analyses color and intensity histograms, mean, variance, skewness, kurtosis, edge distribution, and texture signatures (LBP, GLCM) to spot irregularities from tampering.

(b) Entropy Analysis

Monitors entropy changes and block-wise deviations that signal embedded data—entropy shifts indicate steganography or hidden modifications.

(c) Metadata and EXIF Forensics

Checks for missing or altered EXIF tags, inconsistencies in camera info, timestamps, and unexpected embedded data, which may suggest image tampering.

(d) Min-Hash Signatures

Uses hashed feature descriptors to efficiently compare images; matching Min-Hash signatures reveal similar patterns, aiding rapid identification of malicious variants.

Following feature extraction, the system applies multiple machine-learning models to classify images as benign or malicious.

Each model produces a probability score indicating the likelihood of malicious content.

(a) Random Forest Classifier

(b) Logistic Regression

Figure 3: Sample QR codes which were categorized as malicious vs benign.

Malicious QR Code Detection

QR codes have emerged as a significant attack vector in modern threat campaigns, enabling adversaries to embed malicious URLs, initiate credential-harvesting workflows, or trigger drive-by malware downloads. Because QR codes visually appear identical whether malicious or benign, an AI-driven approach is required to analyze both the QR code’s visual structure and the payload encoded within it.

A QR code is decoded into structural and URL-based features:

q=fqr(xqr)∈Rdqr

Let:

Qs= structural/mask features
Qe= encoding/error-correction metadata
Qu= embedded URL features

q=[Qs ∥ Qe ∥ Qu]

If the QR contains a URL, then:

Qu=furl(decoded_url)

The system processes QR codes through a two-stage extraction pipeline: structural analysis and payload inspection. This multimodal design ensures that threats are detected whether they originate from manipulated QR patterns or malicious encoded URLs.

(a) Structural Features (QR Code Forensics)

Attackers often modify the internal structure of QR codes to embed hidden payloads or obscure detection. The system extracts:

Error-correction level (L/M/Q/H)
Mask and format information
Alignment and timing patterns
Module density and block uniformity

These structural signals help identify tampered or synthetically generated QR codes designed to bypass scanners.

(b) Content & Payload Features (Encoded Data Analysis)

The primary security risk of QR codes lies in the URL or command sequence encoded within them. After decoding the payload, the system performs:

URL normalization and sanitization
Lexical feature extraction (length, characters, suspicious keywords)
Domain-level feature checks (WHOIS, DNS, domain age, hosting reputation)
Redirection-chain modeling to detect multi-stage phishing or malware-delivery paths

By integrating the entire URL-feature pipeline into QR-code analysis, the system inherits all strengths of the URL classifier and applies them seamlessly to QR-derived content.

QR-based classification extends naturally from the multimodal design, using machine-learning models to identify malicious patterns in both the QR structure and its embedded payload.

(a) Structural Classifier

Features extracted from the QR image (module shapes, masks, error-correction anomalies, density patterns) are passed into models such as:

Random Forest
Gradient Boosting
Multi-Layer Perceptron (MLP)

These models detect unusual structural behavior indicative of tampering or malicious encoding.

(b) Payload Classifier (URL-Driven)

Since most QR-code attacks rely on URLs, the decoded payload is analyzed using the same classifier suite used in URL detection:

Logistic Regression
Random Forest
CatBoost / Gradient Boosting
MLP for high-dimensional embeddings

This unifies QR detection with the broader multimodal pipeline.

Model Implementation

The implementation of the proposed malware-detection framework is based on a comprehensive AI-driven pipeline. This pipeline encompasses several stages, including data preprocessing, multimodal feature engineering, model training, hyperparameter optimization, and performance evaluation. Although the overall system is designed to support URLs, images, and QR codes, the core machine-learning implementation described here focuses on URL-based models. These models serve as the foundation for all payload-driven components in the system, including those that analyze QR-decoded URLs.

1. Data Preprocessing and Feature Transformation

All datasets undergo standardized preprocessing for reliable performance across data types. For URLs, steps include text normalization (lowercasing, removing escape sequences and extra characters), tokenization (extracting tokens from paths, subdomains, queries), categorical-to-numerical encoding, Min–Max scaling, and optional label balancing via sampling. After preprocessing, data is split into training, validation, and test sets to prevent leakage and support model evaluation.

2. Feature Extraction Pipeline

Feature engineering distinguishes malicious from legitimate activity by creating structured feature vectors from URLs. Key indicators include:

(a) Lexical: Raw IP address usage, hostname/path/query lengths, counts of special characters (@, ?, %, =, -, /), dots, subdomains, directory depth, and frequency of risky keywords like login, verify, authenticate.

(b) Domain-Level: WHOIS status (domain availability, age, expiration), DNS records (MX, NS, TTL consistency), and registrar reputation.

These metrics together produce a numerical feature vector for learning algorithms.

3. Model Training and Optimization

Several machine learning models are used to assess classification over linear and nonlinear boundaries:

(a) Random Forest Classifier

Random Forest aggregates multiple decision trees built from feature subsets, making it robust against overfitting and adept at capturing complex URL patterns.

y^=mode(h1(u),…,hT(u))

(b) Logistic Regression

This linear, interpretable model estimates the probability of a URL being malicious using a sigmoid function.

P(y=1∣u)=σ(w⊤u)

Models are trained with cross-validation for hyperparameter tuning, regularization (L1/L2 for logistic regression), and calibration methods like Platt scaling. Performance is evaluated on independent test data.

4. Classification and Prediction

In the deployed pipeline:

Input features → pass through trained model(s)
Model outputs probability p(malicious)
Decision threshold is applied
Output is labeled as:

Malicious if p

Benign if p<

These probabilities integrate into the multimodal fusion layer, where URL predictions contribute to the final malware score alongside image and QR classifiers.

5. Performance Evaluation Metrics

To quantify model performance, the system evaluates several standard classification metrics:

(a) Accuracy

Accuracy=TP+TNTP+TN+FP+FN

(b) Precision

The proportion of predicted malicious samples that are genuinely malicious:

Precision=TPTP+FP

Measures how effectively the model identifies actual malicious URLs:

Recall=TPTP+FN

(d) F1-Score

A balanced measure combining precision and recall:

F1=2⋅PrecisionRecallPrecision+Recall

(e) Confusion Matrix

Provides a structured visualization of all four prediction outcomes (TP, TN, FP, FN), enabling detailed error analysis and threshold adjustments.

6. Integration into Multimodal Framework

The trained URL classifier is a key component of the broader multimodal system:

It directly classifies URLs.
It classifies QR-decoded payload URLs.
Its probability output contributes to the final fused malware score.

In summary, the URL model implementation constitutes a foundational element of the system’s AI-driven, multimodal security architecture.

Results and Evaluation

The performance of the proposed AI-driven multimodal malware detection framework was evaluated across three modalities: URLs, images, and QR codes. In addition, a combined multimodal fusion analysis was performed. To ensure fair comparison and prevent data leakage, experiments utilized stratified train, validation, and test splits. Standard classification metrics were used to assess performance, including Accuracy, Precision, Recall, F1-Score, ROC-AUC, and Confusion Matrices.

1) URL Classification Results

The URL dataset comprised 120,000 labeled URLs, representing various categories such as benign, phishing, malware-hosting, and suspicious redirects.

Best Performing Models

Random Forest (RF)
Logistic Regression (LR)
CatBoost
MLP

Table 1 below presents the average performance of these models on the held-out test set.

Table 1 – URL Classification Results

Metric	Random Forest	Logistic Regression	CatBoost	MLP
Accuracy	96.2%	86.1%	97.4%	94.8%
Precision	95.7%	85.6%	97.9%	94.1%
Recall	95.9%	86.1%	97.2%	94.4%
F1-score	95.8%	84.3%	97.5%	94.2%
ROC-AUC	0.982	0.911	0.991	0.971

Observations

CatBoost demonstrated the best overall performance, primarily due to its robust handling of categorical URL patterns.
Random Forest showed nearly equivalent performance, particularly excelling with irregular lexical URLs.
Logistic Regression has lagged, largely because of its limitations in modeling non-linear relationships.

2) Image Classification Results

The image dataset comprised 42,000 samples, which included benign images, steganographic payload images, and manipulated malware-carrier images.

Feature sets were constructed to capture relevant characteristics for classification tasks.

Table 2 – Image Classification Results

Metric	Random Forest	MLP
Accuracy	94.3%	95.1%
Precision	93.8%	95.4%
Recall	92.6%	94.8%
F1-score	93.2%	95.1%
ROC-AUC	0.963	0.975

Observations

The MLP classifier outperformed Random Forest by effectively capturing nonlinear relationships in entropy-texture interactions.

Entropy shifts and metadata inconsistencies were identified as the most important features for classification.

3) QR Code Classification Results

Evaluation used 30,000 QR samples including the following categories:

Legitimate website links

Phishing redirect QR codes

Malware-delivery QR images

Synthetically manipulated QR structures

The evaluation considered both structural signals and payload (URL) signals.

Table 3 – QR Code Classification Results

Metric	Structural Classifier	Payload (URL) Classifier	Hybrid (Fusion)
Accuracy	91.4%	96.2%	97.1%
Precision	90.8%	95.9%	97.3%
Recall	89.5%	96.0%	96.8%
F1-score	90.1%	95.9%	97.0%
ROC-AUC	0.942	0.982	0.987

Observations

Payload-based detection (decoded URLs) outperformed pure structural analysis.

Hybrid fusion significantly improved scores, highlighting the importance of multimodal signals.

4) End-to-End Multimodal Fusion Results

To evaluate the full system, predictions from three distinct classifiers were aggregated:

URL classifier

Image classifier

QR classifier

These predictions were combined using late fusion, applying optimized weights determined through validation to maximize performance.

Table 4 – Multimodal Fusion Performance

Metric	Fusion Model
Accuracy	98.3%
Precision	98.1%
Recall	97.9%
F1-score	98.0%
ROC-AUC	0.995

Observations

Fusion improved classification by ~2% over best single-modality models.

Most notable gains were achieved on hard borderline samples, such as URLs embedded in QR codes and modified images.

ROC-AUC nearing 0.995 demonstrates the system’s near-perfect ability to discriminate between benign and malicious samples.

Conclusion

This work introduces an AI-based multimodal malware detection system that analyzes URLs, images, and QR codes by integrating diverse security signals such as lexical patterns, metadata, entropy, QR features, and similarity signatures. By leveraging multiple machine-learning classifiers for each modality and combining their outputs through a late-fusion ensemble, the system boosts accuracy and reliability, achieving 98.3% accuracy, 98.1% precision, 97.9% recall, and a ROC-AUC of 0.995—outperforming single-modality approaches. The results show that this multimodal framework effectively detects advanced threats like obfuscated URLs and steganographic attacks, offering robust, scalable, and adaptive malware protection.

Future Directions

Future work will explore the integration of deep learning–based multimodal embeddings, sequence-aware URL models, vision transformers for image-based payload detection, and adversarial-robust training techniques to further enhance detection accuracy under evolving threat landscapes. Expanding the system to include behavioral telemetry and network-flow signals presents an additional opportunity for strengthening next-generation cyber defense. Overall, the results establish that multimodal AI systems provide a strong foundation for defending against the increasingly complex and cross-channel nature of modern malware campaigns.