How To

Finding Duplicates in a 50M+ Image Database: Lessons for Real Estate Platforms

Finding Duplicates in a 50M+ Image Database: Lessons for Real Estate Platforms

A few years ago, while searching for an apartment with my future wife, I encountered a problem that many renters are all too familiar with: fake listings. The same photos appeared in multiple ads, often tied to properties that didn’t exist. Scammers simply copied images from legitimate sources and reposted them with new descriptions and prices.

What started as personal frustration turned into a technical challenge I couldn’t ignore. I decided to build a system that could automatically recognise when the same photo appeared across different listings – even when scammers tried to disguise it.

At first glance, it seemed like a straightforward task: detect identical images and flag duplicates. But the reality was far more complex. Real estate scammers rarely use exact copies – they crop images to remove watermarks, resize them, or overlay new text or branding. The challenge was to build a system that could see beyond those surface manipulations and identify when two photos were, in essence, the same.

Over time, that idea evolved into a high-performance image search engine capable of processing more than 150 million images across 18 terabytes of storage, with 95% of searches completing in under eight seconds. What began as a side project became an exercise in designing for scale, speed, and real-world reliability.

Seeing What Computers Don’t

Traditional image comparison methods fail when faced with manipulation. A cropped or resized image looks entirely different to an algorithm that compares pixels. The key was to teach the computer to see images the way humans do – not by matching every pixel, but by identifying distinctive features that remain consistent even when the image changes.

Every photo has unique “key points”: corners of walls, edges of furniture, and light transitions that define its structure. These features are what our brains unconsciously use to recognise familiar images. I tested multiple detection methods on a dataset of 10,000 sample images to determine which algorithm could most efficiently capture these features.

Algorithm Average coverage Maximum coverage Processing time
FAST 20.4% 59.3% 0.251 sec
BRISK 35.9% 91.8% 1.274 sec
AKAZE 38.8% 81.1% 0.503 sec

The results revealed that the AKAZE algorithm offered the right balance between speed and precision – identifying key points in just half a second per image while maintaining high accuracy. That balance between efficiency and reliability would become the guiding principle of the project.

Turning Features into Digital Fingerprints

Once the system could recognise essential features, the next step was to describe them in a compact, comparable form – essentially, creating a digital fingerprint for every image.

Each keypoint was transformed into a 512-bit code that represented a small section of the image. Two images that shared enough of these codes were considered related, even if one was resized, cropped, or branded differently. It was a simple but powerful approach: instead of comparing entire images, the system compared their underlying “fingerprint patterns.”

In testing, this method worked remarkably well on clean, high-resolution images. But online photos are rarely perfect. Real-world images come compressed, filtered, and resized, making pixel-perfect matches impossible. The system needed to handle that messiness gracefully.

Finding Balance in Imperfection

To handle imperfect data, I designed a matching tolerance – a system that didn’t demand exact matches, but instead recognised when two images were close enough to be considered duplicates.

Instead of requiring identical fingerprints, the algorithm counted the number of positions that differed between two codes. Through extensive testing on thousands of modified images, I found that the best results were achieved by striking a balance between sensitivity and flexibility – precise enough to avoid false positives, yet tolerant enough to catch authentic duplicates despite minor variations.

That sense of balance became the foundation of the project’s success. It was a reminder that in large-scale systems, perfection isn’t the goal – reliability is.

Scaling the Search

The first working version compared every new image against the entire database – an approach that worked for small datasets but collapsed at scale. With a million stored images generating hundreds of millions of feature points, even a single search took minutes. Processing new daily uploads would have required either enormous infrastructure or impossible patience.

The breakthrough came from an old concept in computer science: prefix trees. Instead of searching linearly, the system organises data in a decision tree format, where each path represents a sequence of features. When a search exceeds the allowed margin of difference, the system stops exploring that path, saving enormous computational time.

That one optimisation transformed performance. What once took 47 seconds per search dropped to just 3.5 milliseconds. The database even became 3GB smaller, since identical data paths were automatically merged. From there, optimisation became a science of precision – reducing latency, compressing memory usage, and fine-tuning algorithms until the system achieved real-time search performance at scale.

Impact in Action

The final system processed 150 million images, performing real-time duplicate detection across multiple servers. At its peak, it ran on nine machines; later optimisations reduced it to one – a single, high-memory server that performed as efficiently as a distributed cluster.

However, beyond its technical achievement, the project addressed a genuine problem for online platforms. It enabled:

  • Fraud prevention: catching duplicate listings before they go live.
  • Content verification: identifying stolen or recycled photos automatically.
  • Quality assurance: filtering out low-quality or misleading images that damage user trust.
  • User experience improvement: ensuring that what users see truly reflects reality.

What makes the system remarkable is the integrity it brings to digital marketplaces. Real estate platforms, in particular, rely on visual cues to build trust. By ensuring image authenticity, they protect users from scams and strengthen confidence in the platform.

The Human Side of Innovation

What began as a personal irritation evolved into an engineering experiment – and ultimately into a scalable, real-world solution that could be applied far beyond the realm of real estate. The same principles can power content verification, copyright protection, or fraud detection in any industry where images play a significant role.

For me, the journey reinforced a simple truth about innovation: technical progress often starts with human frustration. Building something truly impactful requires both curiosity and discipline: the curiosity to question how things work, and the discipline to make them work better at scale.

Ultimately, the project wasn’t just about finding duplicates; it was also about identifying and resolving issues. It was about finding balance – between precision and tolerance, experimentation and structure, and above all, between frustration and creation.

Comments
To Top

Pin It on Pinterest

Share This