Chapter 4: Scouting the Competition --- 3 AM Reconnaissance¶

The Discovery¶

It was 3 AM on February 16, and we should have been sleeping. The submission was sent two days ago. The slides were done. The system was deployed and passing evaluation. There was nothing productive left to do at that hour.

And then a GitHub link appeared.

Session transcript, Feb 16 ~02:49

"go through this url by someone who submitted this. check what is the difference between their and our projects and what are their limitations vs what are our limitations?"

The URL pointed to a public repository: MeetInCode/AGENTIC-HONEYPOT. Another team's submission for the same GUVI India AI Impact Buildathon, the same problem statement --- Problem 2: AI-Powered Honeypot. Someone had pushed their entire codebase to a public repo. Their approach, their architecture, their prompts, their tests --- all of it visible on GitHub.

The impulse was immediate and honest: are they better than us?

This is one of those moments in a hackathon that shifts your perspective. You have been building in isolation, making decisions based on your own reading of the problem statement, your own instincts about what matters. Suddenly you can see how someone else interpreted the exact same brief. What did they prioritize? What did they skip? Where did their thinking converge with ours, and where did it diverge?

The Ethics of Looking¶

Before we dove in, there was a brief moment of hesitation. Is it fair to analyze a competitor's public code while the evaluation is still pending?

The answer, after about five seconds of deliberation, was yes. Their code was on a public GitHub repository, created January 26 --- weeks before submission. No private access was required. No authentication was bypassed. They chose to make it visible, likely for the same reason we eventually would: to show their work, to build a portfolio, to demonstrate what they had built. Looking at public code is not espionage. It is open-source culture working as intended.

But the analysis had to be honest. Not "find things to feel superior about," but "understand what they built, where they are strong, where we are strong, and what we can learn." At 3 AM, with post-submission anxiety running high, maintaining that objectivity took deliberate effort. The temptation to cherry-pick weaknesses is real when your ego is on the line.

We asked Claude Code to do the teardown. It had built our entire system --- every module, every regex, every persona prompt --- so it knew exactly what to compare against. And an AI does not have an ego to protect. It reports facts.

The Analysis¶

Their Architecture¶

The competitor's system --- called "Agentic Honeypot" --- used a fundamentally different architecture from ours. Where we went with Firebase Cloud Functions and a single Gemini Flash model, they built on FastAPI with a multi-model ensemble, deployed on Railway.

Their defining feature was the Detection Council: five concurrent LLM agents from two different providers (Groq and NVIDIA NIM), each specialized in a different analysis domain --- safety, linguistics, bot patterns, scam strategies. A separate Judge agent (Llama 3.3 70B) aggregated the votes to reach consensus. The idea was clever. A scam message that fools one model might not fool five.

Our Architecture:                    Their Architecture:

  Single Gemini Flash call           5+ LLM agents in parallel
  + keyword detection                  (Groq: Llama Scout, GPT-OSS, Llama 3.3)
  + confidence blending                (NVIDIA: Nemotron 49B, Minimax M2.1)
       |                                        |
       v                                        v
  Scam classification                Judge agent aggregates votes
  (one model, two signals)           (consensus-based classification)

Their "Split-Process Architecture" was also different from ours. They returned the persona response immediately (synchronous path, under 2 seconds), then ran the detection council and intelligence extraction asynchronously in a background worker pool. We did everything synchronously in a single Cloud Function invocation. Their approach meant lower latency for the initial response. Ours meant simpler deployment and no background job coordination.

Their orchestrator had a worker pool pattern --- bounded concurrent background tasks that could be interrupted if a new message arrived for the same session. That was a detail we had not considered: what happens when a scammer sends three messages in rapid succession? Their system could abort background analysis for stale messages. Ours would process each message independently.

Their Persona System¶

This is where the comparison got most interesting. We had three personas with deep backstories, family members, speech patterns, and scam-type-specific expertise. They had one: Ramesh Kumar, a "Confused Cooperator."

Their single-persona approach was not a limitation --- it was a design choice. Ramesh Kumar was a generic confused Indian person who mixed English and Hindi naturally. He did not have Sharma Uncle's 35 years of SBI banking knowledge or Lakshmi Aunty's Tamil-English code-mixing or Vikram's cyber law awareness. But he did not need to. Their system optimized for something different: short, natural responses (1-2 sentences, under 150 characters) with explicit extraction tactics.

Their prompt was more prescriptive about extraction methods. Where our personas used personality to naturally draw out information ("Beta, which bank you said?"), their Ramesh Kumar used labeled strategies: the Compliance Trick ("ok I'll send... give me your UPI id"), the Technical Barrier ("link not opening... send on WhatsApp? what's your number?"), the Validation Seeker ("just to confirm... your UPI id ends with @paytm right?"), and the Family Stall ("let me ask my son... meanwhile send me everything on SMS").

Both approaches had merit. Theirs was more systematic and reproducible. Ours was more immersive and culturally specific. A scammer targeting elderly banking customers expects different responses than one running a digital arrest operation --- our persona mapping addressed this. Their single persona handled all scam types with the same personality, but with shorter, more texting-authentic responses.

Their Evidence Extraction¶

Their extraction pipeline combined regex heuristics with an LLM-based entity extractor using Llama-4-Scout. Their regex patterns were broader:

# Their patterns (simpler, broader)
UPI_PATTERN = re.compile(r'[a-zA-Z0-9._-]+@[a-zA-Z]{2,}', re.IGNORECASE)
BANK_ACCOUNT_PATTERN = re.compile(r'\b\d{9,18}\b')

# Our patterns (specific to Indian financial identifiers)
# 30+ explicit UPI handles: @oksbi, @ybl, @paytm, @okaxis...
# Aadhaar with Verhoeff checksum validation
# PAN with entity code verification
# IFSC with branch format matching

Their UPI regex ([a-zA-Z0-9._-]+@[a-zA-Z]{2,}) would catch more patterns --- but also match email addresses, generating false positives. Our 30+ explicit UPI handle list was more specific. Their bank account regex (\b\d{9,18}\b) would match any 9-to-18-digit number, including phone numbers and random digit sequences. Our patterns used contextual detection to reduce false matches.

But they compensated with a powerful LLM extraction step. They sent the entire conversation to Llama-4-Scout with a detailed extraction prompt --- complete with worked examples and JSON output format --- that covered UPI IDs, phone numbers, bank accounts, phishing links, emails, scammer identifiers, case IDs, policy numbers, and order numbers. The LLM acted as a second-pass extractor that could catch things regex missed, and their prompt engineering for this extraction was genuinely well-done.

Their Strengths, Our Strengths¶

After reading through their codebase, the picture was nuanced. This was not a clear "we are better" or "they are better" situation. It was two teams solving the same problem with genuinely different architectures, each with real advantages.

Where They Were Stronger¶

Honest assessment: where they had the edge

Multi-model consensus. Their Detection Council --- five-plus models voting on scam classification --- was a genuinely clever approach to reducing false positives. If a message is ambiguous, having five models weigh in from different analytical angles gives you more signal than a single model's confidence score. This is defense-in-depth for classification, and we had nothing equivalent.

Provider diversity. Groq plus NVIDIA NIM meant different model families with different training data and different failure modes. A scam pattern that Llama misses might be caught by Nemotron. We used a single Gemini Flash model for everything. Our fallback was Gemini 2.0 Flash --- same provider, same family.

Async response path. Their split-process design returned the persona response in under 2 seconds while heavy analysis ran in the background. Our synchronous pipeline did everything serially. For latency-sensitive evaluation, their architecture had an edge.

Response naturalness. "1-2 sentences, under 150 characters, casual texting style." Real people text in fragments. Their Ramesh Kumar sounded like someone actually typing on a phone. Our personas sometimes generated paragraphs. In a real scam interaction, shorter is more believable.

Where We Were Stronger¶

Honest assessment: where we had the edge

Cross-session intelligence. Our Firestore-backed evidence_index tracked scammer identifiers across conversations. If the same UPI ID or phone number appeared in a new session, the system recognized a returning scammer and escalated to more aggressive extraction. Their session manager was in-memory (dict-based) --- sessions were lost on server restart, and there was no cross-session linking. For a production honeypot that runs for weeks, this is the difference between catching a scammer once and building a scammer database.

Cultural depth. Three personas versus one. Sharma Uncle's Delhi Hinglish versus Lakshmi Aunty's Tamil-English versus Vikram's Bangalore tech-bro skepticism. Each engineered for specific scam categories with backstories, family members, and speech patterns that made the engagement feel culturally real.

Strategy state machine. Our system dynamically adjusted its conversation approach --- building trust, then extracting, then direct probing, then pivoting --- with self-correction that backed off when the scammer grew suspicious. Their persona used per-turn extraction tactics. Our approach adapted over the conversation arc. Theirs was more turn-by-turn.

Production hardening. OIDC token verification for Cloud Tasks callbacks. Prompt injection sanitization. Rate limiting. API key validation with constant-time comparison. A full security audit with four critical findings found and fixed. Their system had API key auth but no visible OIDC verification, rate limiting, or input sanitization.

Test coverage. 276 structured test cases versus their scattered test scripts. Our tests covered extraction accuracy for Indian-specific patterns (Aadhaar checksums, IFSC formats, PAN entity codes), persona behavior, classification edge cases, and callback formatting.

Dashboard. A 9-page Streamlit dashboard on Cloud Run with live session monitoring, evidence visualization, and analytics. They had no monitoring interface.

CI/CD. GitHub Actions with Workload Identity Federation (keyless auth), automated testing on every push. Their deployment was Railway via Procfile --- simpler but without automated testing in the pipeline.

What Competition Teaches You¶

The Multiple Valid Architectures Insight¶

The most valuable takeaway from the analysis was not "we are ahead" or "they are ahead." It was the realization that multiple valid architectures exist for the same problem, and the "best" one depends entirely on what you optimize for.

If you optimize for classification accuracy, their multi-model council is the right architecture. Five models voting is provably more robust than one model scoring.

If you optimize for engagement quality and long-term intelligence, our deep-persona + cross-session approach is the right architecture. A scammer who talks to Sharma Uncle for 15 turns reveals more than one who gets a generic 2-sentence response from Ramesh Kumar.

If you optimize for response latency, their async split-process wins. If you optimize for operational simplicity, our single-function synchronous approach wins.

If you optimize for provider resilience, their multi-provider strategy wins. If you optimize for ecosystem integration (Firestore + Cloud Tasks + Secret Manager + Cloud Run), our Google-native approach wins.

Neither architecture was wrong. They were different bets on what matters most for an AI honeypot. And the fact that two teams, reading the same problem statement, made such different bets is itself a fascinating data point about the solution space.

Lesson: Competition reveals your implicit assumptions

We had assumed --- without examining the assumption --- that deep personas were obviously better than a single generic one. Seeing another team choose the single-persona route forced us to articulate why we believed cultural depth mattered. The answer was about engagement duration: a scammer talks longer to someone who feels culturally real. But we had never explicitly stated that tradeoff until we saw someone make the opposite choice. Your competitors do not just show you what they built. They show you what you believe.

The competitor analysis revealed things we had not thought about:

Model diversity is a real defense. We had put all our classification trust in a single Gemini Flash model. If Gemini has a systematic blind spot for a particular scam pattern, we have no fallback. Their multi-model approach addressed this structurally.
Response length matters for believability. Their 1-2 sentence responses were closer to how real people text. Our personas sometimes generated 3-4 sentences with parenthetical asides and comma-heavy constructions. Real people --- especially the elderly personas we were simulating --- type short messages on phones. Shorter is more authentic.
The worker pool pattern. Their async worker pool for background processing was architecturally cleaner than our sequential approach. Separating the fast response path from the heavy analysis path is worth considering for future iterations.

The 3 AM emotional rollercoaster

The analysis went through predictable emotional stages. First: anxiety ("are they better? what if their Detection Council outscores our single model?"). Then: careful reading ("they have five models voting, that is genuinely clever"). Then: comparison ("but they do not have cross-session learning or persistent sessions"). Then: honest reckoning ("we are strong on different axes, and the axes barely overlap"). Finally: gratitude ("we just learned three things we would not have thought of otherwise").

The entire cycle took about 45 minutes. At 3 AM, after a submission, that emotional swing is amplified. Every strength you find in their code feels like a weakness in yours. Every gap in their code feels like temporary relief rather than genuine confidence. Having Claude Code do the objective technical comparison helped. It did not say "do not worry, yours is better." It said "here are the facts, here is what they have, here is what you have." The facts were calming in a way that reassurance would not have been.

The Strategic Response¶

The analysis did not just satisfy curiosity. It triggered action. Within the same day (February 16), we shipped three commits:

791d598  Add per-turn scoring fields for GUVI finals evaluation compliance
6542d2c  Update docs to reflect enriched per-turn response format
3375eb4  Add IFSC, Aadhaar, PAN, crypto wallet extractors and expand test coverage to 276

Seeing their evidence extraction limited to 7 types gave us a clear signal: expanding to 11 types was a differentiator worth investing in that same day. We added IFSC code extraction with branch format validation, Aadhaar number extraction with Verhoeff checksum verification, PAN card extraction with entity code validation, and cryptocurrency wallet extraction for Bitcoin, Ethereum, and TRC-20 addresses. Each new extractor came with parameterized tests. The test suite grew to 276 cases in a single session.

The per-turn scoring fields (scamLikelihood, confidence, riskLevel on every response) were also a direct response to understanding how their Detection Council would present per-turn classification data. We needed to show the evaluator equivalent per-turn signal from our single-model approach.

What We Learned¶

Look at your competitors' code, but honestly

Competitive analysis is only valuable if you are willing to find things the other team did better. If you go in looking for confirmation that you are winning, you will find it and learn nothing. If you go in looking for ideas, you will find those too. The multi-model council idea alone was worth the 45 minutes we spent reading their code.

Multiple architectures can solve the same problem well

The instinct during a hackathon is to believe there is one correct approach and the only question is who executes it better. The reality is that problem spaces --- especially in AI --- have multiple valid architectural solutions. The best response to seeing a different architecture is not "are they right or are we right?" but "what tradeoff did they make that we did not, and what does that reveal about our own assumptions?"

In-memory session storage is a production risk

Their in-memory session manager (dict-based) was the one area where our architectural choice was objectively better for a production system. In a competition evaluation environment, sessions exist for minutes and restarts are rare. In a real production honeypot, sessions can span days or weeks, and a server restart erases everything. Persistence is not optional for intelligence gathering.

Competition is a free architecture review

At 3 AM, we got the equivalent of a senior engineer reviewing our design decisions --- by showing us how someone else solved the same problem. We did not pay for this review. We did not ask for it. A public GitHub repo gave us more useful architectural feedback than any code review tool could have. It showed us our blind spots (model diversity, response length), validated our bets (cultural depth, production hardening), and gave us concrete features to ship before the finals evaluation. If you ever get the chance to see how another team solves the same problem you are solving, take it. Read generously. Learn aggressively. Ship the insights.

Previous: Chapter 3 -- The Presentation | Next: Chapter 5 -- The Rework