Chapter 0: The Problem --- India's Scam Epidemic¶
What We Built¶
Before we write a single line of code, we need to understand what we are fighting. This prologue sets the stage: the scale of India's scam crisis, the most common attack vectors, why traditional defenses fail, and what a different kind of weapon --- an AI honeypot --- can actually accomplish.
ScamShield AI is not an anti-spam filter. It is not a blacklist. It is an offensive system that wastes scammers' time, extracts their operational details, and feeds actionable intelligence back to authorities. By the end of this series, you will have built one from scratch.
Why This Approach¶
The Numbers Are Staggering¶
India's National Cyber Crime Reporting Portal (NCRP) recorded over 31 lakh complaints in 2023 alone. The Indian Cyber Crime Coordination Centre (I4C) estimates that Indians lose more than 1,750 crore rupees annually to cyber fraud --- and that figure captures only the cases that get reported. The real losses are almost certainly multiples higher.
The damage is not evenly distributed. Retirees, homemakers, small-town professionals, and first-generation smartphone users bear the brunt. These are people who grew up trusting institutions --- banks, police, courts --- and scammers exploit exactly that trust.
The Common Scam Playbook¶
Scammers operating in India have converged on a handful of proven scripts. Understanding these is essential to building a system that can mimic their targets convincingly.
mindmap
root((Indian Scam Types))
KYC / Banking Fraud
"Your KYC has expired"
Account block threats
Fake RBI / SBI messages
OTP / CVV extraction
Digital Arrest
Fake CBI / Police calls
Aadhaar linked to crime
"Stay on video call"
Pay to settle case
Job Scams
YouTube like tasks
Daily earning promises
Upfront investment
Telegram/WhatsApp groups
Lottery / Prize
KBC winner messages
Lucky draw claims
Processing fee demands
Tax advance payment
Other
Tech Support fraud
Crypto investment
Sextortion
Custom duty scams
KYC / Banking Fraud is the most prevalent. The script is simple: send an SMS claiming the victim's KYC has expired and their account will be blocked in 24 hours. Include a link. The link leads to a phishing page that harvests UPI PINs, OTPs, and card numbers. Variations impersonate SBI, HDFC, ICICI, and RBI.
Digital Arrest is newer and more psychologically devastating. The scammer calls via WhatsApp video, claims to be from CBI, NIA, or the Cyber Cell, and tells the victim their Aadhaar number has been linked to money laundering or drug trafficking. The victim is told they are "under digital arrest" and must not disconnect. They are pressured into transferring lakhs to "secure accounts" to clear their name. This scam type has exploded since 2023.
Job Scams target unemployed youth and gig workers. The pitch: earn Rs 5,000--10,000 daily by completing simple tasks (liking YouTube videos, rating products). After a few small payouts to build trust, victims are asked to "invest" larger amounts for higher-tier tasks. The money never comes back.
Lottery and Prize Scams are the oldest but still effective. A message arrives congratulating the recipient on winning a KBC jackpot or an international lottery. To claim the prize, pay a processing fee, tax advance, or courier charge. The prize does not exist.
Why Traditional Defenses Fall Short¶
The standard defensive toolkit --- spam filters, number blacklists, awareness campaigns --- operates on the assumption that the best strategy is avoidance. Block the scammer's number. Delete the SMS. Educate people not to click links.
This has three problems:
-
Scammers rotate infrastructure constantly. A blocked phone number is replaced in minutes. A taken-down phishing domain is cloned by the next morning. Blacklists are always behind.
-
Awareness campaigns have diminishing returns. The people most vulnerable --- elderly, rural, first-time internet users --- are the hardest to reach with digital literacy programs. And even tech-savvy people fall for well-crafted social engineering.
-
Pure defense generates no intelligence. When you block a scammer, you learn nothing about their operation --- their UPI IDs, bank accounts, phone numbers, command structure, or modus operandi. The scammer simply moves to the next target.
The Honeypot Paradigm¶
A honeypot inverts the relationship. Instead of avoiding the scammer, you engage them --- deliberately, strategically, and at scale.
flowchart LR
A[Scammer sends\nphishing message] --> B[Honeypot receives\nmessage]
B --> C[AI classifies\nscam type]
C --> D[Persona engages\nscammer]
D --> E[Scammer reveals\nUPI / bank / phone]
E --> F[Evidence extracted\nand reported]
F --> G[Intelligence sent\nto authorities]
D --> H[Scammer wastes\n10-30 minutes]
H --> I[Fewer real victims\nreached]
Every minute a scammer spends talking to a honeypot is a minute they are not talking to a real victim. That alone has value. But the real payoff is intelligence: the UPI IDs they ask victims to pay, the bank accounts they use, the phone numbers they operate from, the fake case numbers and employee IDs they fabricate. Collected at scale, this data can help law enforcement map scam networks and freeze accounts.
What Makes ScamShield AI Different¶
Previous honeypot systems (where they exist at all) have been manual or semi-automated. A researcher answers scam calls, tries to keep the scammer talking, and manually notes down details. This does not scale.
ScamShield AI automates the entire loop:
| Capability | Manual Honeypot | ScamShield AI |
|---|---|---|
| Response time | Minutes (human typing) | Sub-second (LLM) |
| Hours of operation | When researcher is awake | 24/7 |
| Concurrent conversations | 1 | Unlimited (serverless) |
| Cultural authenticity | Depends on researcher | Engineered personas |
| Evidence extraction | Manual notes | Regex + keyword, 11 evidence types |
| Intelligence reporting | End of day | Real-time callbacks |
Three design decisions make the system work:
Cultural authenticity over generic chatbots. Indian scammers target specific demographics. A KYC scam targets an elderly retired banker. A digital arrest scam targets an anxious young professional. Our personas --- Sharma Uncle, Lakshmi Aunty, Vikram --- are engineered with authentic speech patterns, family references, regional expressions, and domain knowledge that make scammers believe they have found a real victim.
Regex-first evidence extraction. Indian financial identifiers have distinctive formats: UPI IDs (user@oksbi), IFSC codes (SBIN0001234), Aadhaar numbers (12 digits with Verhoeff checksum), PAN cards (ABCDE1234F). Regex catches these reliably and instantly. We do not wait for the LLM to extract them.
Serverless architecture for zero-cost scaling. Firebase Cloud Functions spin up per request and scale to zero when idle. During a hackathon evaluation, the system might handle dozens of concurrent scam conversations. Between evaluations, it costs nothing.
The Hackathon Context¶
ScamShield AI was built for the GUVI India AI Impact Buildathon --- a hackathon focused on AI systems that address real problems in India. The evaluation system sends simulated scam messages to our endpoint and scores our responses on:
- Response quality (M1): Does the AI engage convincingly? Does it ask questions? Does it identify red flags?
- Scam detection reliability (M2): Does it correctly classify the scam type with high confidence?
- Conversation quality (M2.5): Does it ask verification questions, extract information, and sustain engagement?
- Evidence extraction (M3): Does it identify UPI IDs, bank accounts, phone numbers, and other financial identifiers?
- Intelligence reporting (M4): Does it send a structured callback with the accumulated evidence?
The evaluator acts as the scammer, sending up to 10 messages per session. Our system must respond to each one in character, extract evidence progressively, and report its findings via a callback API. The constraint is real-time: responses must arrive within seconds, not minutes.
This series walks through every component we built to make that work.
What We Learned¶
Building ScamShield AI taught us lessons that apply far beyond this specific project:
Lesson: Domain immersion before architecture
We spent more time reading FIR reports, watching scam call recordings on YouTube, and cataloguing Indian financial ID formats than we spent choosing our tech stack. That research directly informed our persona design, our regex patterns, and our classification categories. If you are building an AI system for a specific domain, immerse yourself in that domain first. The architecture follows.
Lesson: The attacker's economics matter
Scammers operate on volume. They send thousands of messages hoping for a small hit rate. A honeypot that wastes even 10 minutes per conversation is devastating to that business model. When designing adversarial systems, think about your opponent's unit economics.
Lesson: Cultural context is not optional
A generic English chatbot would be immediately identified as fake by any scammer operating in India. The Hinglish mix, the "ji" and "beta" and "kanna", the references to specific banks and government agencies --- these are not cosmetic. They are the difference between a scammer investing 30 minutes in a conversation and hanging up after the first message.
Key Architectural Decision¶
We chose to build an offensive system, not a defensive one.
The conventional approach to scam detection is classification and blocking: detect the scam, warn the user, block the number. This is valuable work, but it operates in a purely reactive mode.
We deliberately chose to engage scammers rather than avoid them. This means our system must be convincing enough to fool a human adversary --- a much harder bar than simply classifying messages. But the intelligence yield makes it worth the additional complexity.
The rest of this series shows how we met that bar, one component at a time.