Chapter 1: The Blitz --- Zero to Working System¶
The First Message¶
The very first session opened with a question:
"do you have knowledge of what this project is about?"
It was February 5, 2026. We had a hackathon deadline --- the GUVI India AI Impact Buildathon --- a rough idea (an AI honeypot that engages Indian phone scammers), and nothing else. No code. No architecture. No deployed infrastructure.
By the end of that day, we would have 30 commits, three culturally-authentic personas, a working Gemini-powered conversation engine, Firestore session persistence, evidence extraction for seven intelligence types, cross-session scammer detection, self-correcting strategy adjustment, and a Cloud Tasks callback pipeline.
This chapter is about those two days --- Feb 5 and Feb 11 --- that laid down the entire foundation of ScamShield AI.
Part 1: Day One (Feb 5) --- 30 Commits Before Midnight¶
Choosing the Stack¶
The first architectural decision was the most consequential: Firebase Cloud Functions + Google Gemini Flash.
Key Decision: Firebase + Gemini
We chose this stack for three reasons:
-
Serverless with zero cold-start cost. Cloud Functions scale to zero when idle and scale up per-request during evaluation. During a hackathon, you might get 50 concurrent conversations during testing and zero traffic for hours. Pay-per-invocation is the only model that makes sense.
-
Google ecosystem integration. Firestore for persistence, Cloud Tasks for scheduled callbacks, Secret Manager for API keys --- all accessible via Application Default Credentials with no additional auth configuration.
-
Gemini Flash speed. The GUVI evaluator expects sub-second responses. Gemini Flash delivers. We needed speed over reasoning depth --- a honeypot response does not need to solve differential equations, it needs to sound like a confused uncle asking "beta, which bank you said?"
The alternative was OpenAI GPT-4 on AWS Lambda. We rejected it because cross-provider auth (Lambda calling Firestore, managing OpenAI keys in AWS Secrets Manager) adds complexity that eats hackathon hours. Staying within Google's ecosystem meant one set of credentials, one CLI, one billing account.
The First Working System¶
The second commit of the day tells the story:
This single commit contained the entire skeleton: request parsing, scam classification, persona selection, Gemini response generation, evidence extraction, and the GUVI-compliant response format. It was functional but fragile --- a proof-of-concept that could handle a single conversation turn.
The next few commits were the usual early-project cleanup:
c865a3a - Add .gitignore for Python cache and environment files
0958ade - Add API key authentication and environment setup
a22b529 - Add Firebase project configuration
64d447d - Add deployment script and update gitignore
db29286 - Fix relative imports for Firebase Functions deployment
e5358c5 - Add secrets configuration to Cloud Function
Six commits to go from "code works locally" to "code deploys to Firebase." Each one fixing a small thing --- a missing .gitignore, a relative import that breaks in the Cloud Functions runtime, a secret that was not wired through. This is the unglamorous reality of serverless development: half your commits are deployment configuration.
The Model 404 Saga¶
Then came the three commits that every developer who has worked with rapidly-evolving LLM APIs will recognize:
d00c16f - Fix Gemini model 404 errors and relax GUVI input validation
205a17c - Update to Gemini 2.0 Flash experimental model
01e683f - Update to Gemini 3.0 Flash model
The Model Name Comedy of Errors
We started with gemini-3.0-flash. 404. Tried gemini-2.0-flash-experimental. 404. Tried gemini-3-flash. 404. Each attempt was a deploy, a test, a check of the logs, and another commit.
The problem was that Google's model naming conventions were in flux. The documentation said one thing, the API accepted another, and the model that actually existed was gemini-3-flash-preview --- a name that appeared in none of the documentation we had read.
Three commits and 45 minutes just to get the right string for a model name. This is the kind of thing that never appears in tutorials but consumes real engineering time.
Building the Conversation Engine¶
With a working model connection, we turned to making the honeypot actually convincing. The next cluster of commits shows a rapid iteration cycle --- each one addressing a specific failure mode observed during testing:
f4fec9c - Fix critical honeypot issues: classifier, responses, extraction, detection
82f828b - Major behavioral + technical fixes for honeypot system
bb94d9e - Apply behavioral improvements to all personas
The classifier was misidentifying scam types. The responses were too polished --- they sounded like a chatbot, not a confused victim. The evidence extraction was missing obvious patterns. Each fix came from sending test scam messages and reading the responses with a critical eye: "Would a scammer believe this?"
The Three Personas¶
We built all three personas in a single extended session. Sharma Uncle, Lakshmi Aunty, and Vikram were not arbitrary characters --- each one was designed to be the ideal victim for a specific category of Indian scam.
Sharma Uncle (Rajendra Sharma, 67, retired SBI branch manager, Dwarka, Delhi): The target for KYC and banking scams. He knows banking procedures from 35 years of experience but struggles with technology. He types slowly, makes mistakes, calls everyone "beta," and constantly references his wife Kamla and son Rohit who "handles all the computer things." The paradox that makes him effective: he catches banking inconsistencies naturally ("But beta, SBI never asks for OTP on phone call") while still appearing vulnerable enough to keep the scammer engaged.
Lakshmi Aunty (Lakshmi Venkataraman, 58, homemaker and retired teacher, T. Nagar, Chennai): Deployed for lottery, prize, and insurance scams. Her Tamil-English speech pattern ("Aiyo, kanna, what you are saying?") is instantly recognizable. She goes on tangents about her sons, asks questions that seem naive but extract information ("Which bank I should go to deposit, kanna?"), and has just enough knowledge of LIC insurance to ask awkward questions about fake insurance schemes.
Vikram (Vikram Malhotra, 32, senior software developer, Koramangala, Bangalore): The persona for digital arrest and tech support scams. Unlike the other two, Vikram is not vulnerable --- he is suspicious. He demands documentation, mentions his friend Rahul at the Cyber Cell, and asks pointed questions. Against digital arrest scams ("Your Aadhaar is linked to money laundering"), this skeptical-but-engaged persona keeps scammers talking while they try to overcome his objections.
Research Before Prompts
Every persona detail was informed by actual Indian scam patterns. We watched YouTube compilations of scam calls, read FIR reports, and studied the speech patterns of real victims who shared their experiences online. The "ji" and "beta" and "kanna" are not decoration --- they are signals that tell a scammer "this person is from the demographic I am targeting."
Evidence Extraction and Session Persistence¶
The afternoon was spent on intelligence gathering --- the entire point of a honeypot.
4e9ec16 - Add Firestore session persistence for evidence accumulation
47ad442 - Fix deployment timeout: migrate to google.genai, add lazy Firestore init
f922804 - Fix silent Firestore failures with in-memory fallback
Evidence extraction uses regex patterns tailored to Indian financial identifiers. UPI IDs (user@oksbi), bank account numbers, IFSC codes (SBIN0001234), phone numbers with +91 prefix --- each has a distinct format that regex catches reliably. We do not wait for the LLM to identify these; regex runs on every message and catches them instantly.
Firestore session persistence was necessary because scam conversations span multiple turns. Evidence from turn 3 needs to be combined with evidence from turn 7. Without persistence, each turn would start fresh with no memory of what had already been extracted.
The In-Memory Fallback
One of the early lessons: Firestore writes can fail silently in Cloud Functions if initialization is lazy and the connection is not yet established. We added an in-memory fallback so that if Firestore is unreachable, evidence still accumulates in the function's memory for the duration of the request. Not ideal for production, but it prevented total data loss during the early development phase.
Confidence Scoring and Self-Correction¶
The evening commits show the system becoming smarter:
ba5d734 - Fix confidence scoring: always classify, proportional boost
c2b7d71 - Fix confidence scoring: blend LLM/keyword, simplify scamDetected
ae72d0d - Add self-correction: LLM-based strategy adjustment
Confidence scoring blends two signals: the LLM's own classification confidence and keyword-based detection (if the message contains "KYC expired" and "account blocked," that is strong evidence of a KYC scam regardless of what the LLM thinks). The blend prevents the LLM from under-classifying messages that contain obvious scam markers.
Self-correction was a late addition that night. The system reviews its own conversation history and adjusts strategy: "Am I being too aggressive? The scammer might hang up. Pull back and ask more naive questions." This runs as a separate Gemini call after the main response, analyzing the conversation trajectory and adjusting the approach for the next turn.
Cross-Session Learning and Callbacks¶
The last commits before midnight:
93fd5ba - Add cross-session learning for known scammer detection
898e228 - Inform agent about known scammers for aggressive probing
0d91855 - Add callback retry logic and improve trigger conditions
Cross-session learning checks Firestore for previous conversations with the same sender. If a number has been seen before, the persona switches to more aggressive evidence extraction --- the scammer has already revealed some information in a previous session, so we can push harder.
The callback pipeline sends accumulated intelligence to the GUVI evaluator's API. Getting the trigger conditions right was tricky: too early and we report incomplete intelligence, too late and the evaluator has already scored the session. We settled on triggering after a configurable message threshold with retry logic for transient failures.
4dff1d2 - Delay callback until engagement complete (10+ messages)
0d3ba9a - Implement timeout-based callback using Cloud Tasks
0f41ce5 - Fix: Remove non-existent GUVI_API_KEY secret from send_delayed_callback
That last commit at 01:26 AM --- removing a reference to a secret that did not exist --- is the kind of 2 AM fix that tells you it is time to stop.
Part 2: Day Two (Feb 11) --- The Feature Expansion¶
Six days later, we came back for the second major push. The core system worked. Now it needed everything else.
The Opening Session¶
The day started with the developer catching up Claude Code on context:
"i have got this email. check if functionality is working or not?"
The GUVI organizers had sent test cases. The system needed to handle them. But beyond that, the developer had a much larger agenda for the day --- one that would spawn 17 sessions.
CI/CD and Documentation¶
The first four commits were infrastructure:
ef2ead0 - Add GitHub Actions CI/CD and remove hardcoded secrets from deploy.sh
7433e8c - Fix deploy: create Python venv before firebase deploy
21867e5 - Add web/index.html for Firebase Hosting deploy
e0d1dff - Add CI/CD documentation and CLAUDE.md project context
Automated testing and deployment. Remove hardcoded secrets (a lesson from the security audit that would come two days later). Create the Python venv that Firebase CLI demands for Python function deployment.
The WIF Migration¶
Then came a significant security improvement:
9615cd4 - Migrate CI/CD auth from FIREBASE_TOKEN to Workload Identity Federation
c966a46 - Add serviceusage.serviceUsageConsumer role to WIF setup
Key Decision: Workload Identity Federation over Static Tokens
The original CI/CD used FIREBASE_TOKEN --- a static secret stored in GitHub. Static secrets are a liability: they never expire, anyone with repo access can extract them, and rotating them is a manual process.
WIF eliminates stored credentials entirely. GitHub Actions proves its identity via OIDC, and GCP exchanges that token for temporary credentials that expire in one hour. The migration took one session, including writing the setup script and debugging the IAM roles.
The developer's message that triggered this:
"migrate to service account key for CI/CD"
We actually went further than requested --- from static key to keyless authentication.
The Dashboard, Playground, and Production Hardening¶
The middle of the day was the most productive stretch:
09bdfc2 - Add real-time Streamlit dashboard and production utilities
6d3954d - Harden production: atomic Firestore ops, circuit breakers, rate limiting, sanitization
d1c1978 - Add dashboard Cloud Run deployment and fix Firestore connectivity
97b29c2 - Add interactive playground and presentation content for buildathon
Four commits, four major features. The Streamlit dashboard with live session monitoring. Production hardening with rate limiting and input sanitization. Cloud Run deployment. An interactive web playground for demonstrations.
The Plug-and-Play Strategy¶
The afternoon session produced one of the most strategically interesting decisions of the project:
"are these pushed to github, and have we explicitly stated in the readme file so that when the new test cases drop, we can plug n play and merge with the main branch?"
Key Decision: Pre-Built Optimization Branches
The GUVI hackathon would release new test cases at unpredictable times. Instead of scrambling to implement improvements under pressure, we pre-built seven optimization branches:
| Branch | Focus |
|---|---|
optimize/multilingual |
Hindi/Hinglish/Tamil responses |
optimize/edge-cases |
Empty messages, repeated messages, gibberish |
optimize/enhanced-extraction |
More regex patterns, fuzzy matching |
optimize/self-correction-v2 |
Improved strategy adjustment |
optimize/latency |
Response time optimization |
optimize/scoring |
GUVI rubric alignment |
outside-guvi |
Dashboard, playground, documentation |
When new test cases dropped, we could analyze which areas needed improvement and merge the relevant branch immediately --- no implementation delay. This strategy paid off directly on Feb 21 when the final test cases arrived.
The Evening Push¶
The day ended with OAuth integration and dashboard polish:
4771b3b - Add Google OAuth + interactive chat testing page to dashboard
2a8bc9c - Hide sidebar on login/denied screens
Google OAuth was later replaced with PIN auth (simpler, no OAuth provider configuration needed for a small team), but the testing page survived and became one of the most-used dashboard features.
End of Day 1 vs. End of Day 2¶
End of Feb 5: What We Had
- Working Firebase Cloud Function handling GUVI webhook
- Three culturally-authentic personas (Sharma Uncle, Lakshmi Aunty, Vikram)
- Scam classification with blended LLM + keyword confidence scoring
- Evidence extraction for 7 intelligence types
- Firestore session persistence with in-memory fallback
- Cross-session scammer detection
- Self-correcting strategy adjustment
- Cloud Tasks callback pipeline with retry logic
- 33 commits, all on
main - No tests, no CI/CD, no dashboard, no documentation
End of Feb 11: What We Had
- Everything from Day 1, plus:
- GitHub Actions CI/CD with Workload Identity Federation (keyless)
- Production hardening: rate limiting, input sanitization, circuit breakers
- 9-page Streamlit dashboard on Cloud Run
- Interactive web playground for demonstrations
- Google OAuth (later replaced with PIN auth)
- Documentation audit and CLAUDE.md project context
- 7 pre-built optimization branches ready to merge
- Product roadmap for post-hackathon expansion
- 52 total commits
- Still no formal test suite (that came Feb 12)
The gap between those two snapshots tells the story of what happens between "it works on my machine" and "it is ready for other people to use." The core intelligence --- the thing that makes scammers believe they are talking to a real person --- was built in a single day. Everything else --- deployment, monitoring, security, documentation --- took another full day of equally intense work.
Lesson: The 50/50 Split
Roughly half of our total development effort went into the core AI system (personas, classification, extraction, conversation engine). The other half went into everything around it (deployment, CI/CD, dashboard, security, testing). If you are estimating a project and only budgeting for the "interesting" parts, double your estimate.
Next: Chapter 2: Hardening Under Pressure --- Dashboard, Security, and Deployment