Chapter 6: Reflections --- What We Learned About Human-AI Collaboration¶

Looking back on 80 sessions, 87 commits, and three weeks of building with an AI coding partner.

The Division of Labor¶

Somewhere around session 40, a pattern became clear. We were not using Claude Code as a code generator. We were using it as a pair programmer with infinite patience and a very good memory for syntax.

The human brought:

Product vision. What should a honeypot for Indian scammers actually do? What makes Sharma Uncle believable to a scammer running a KYC scheme from a call center in Jamtara? These are questions that require cultural knowledge, empathy for both the scammer's psychology and the victim's vulnerability, and a judgment about what "good enough" looks like under a hackathon deadline.
Domain knowledge. The twelve scam categories (KYC_BANKING, DIGITAL_ARREST, SEXTORTION, JOB_SCAM, LOTTERY_PRIZE, TECH_SUPPORT, and six more) came from watching Indian scam call recordings, reading FIR reports, and understanding the specific language patterns that distinguish a KYC scam from a digital arrest intimidation. An LLM can generate scam classification categories, but choosing the right twelve --- the ones that cover 95% of real-world Indian scams without overlapping ambiguously --- required domain expertise that no amount of pretraining provides.
Strategic decisions. When to stop building features and submit. Which scoring categories to prioritize during the rework. Whether to replace Google OAuth with PIN auth. Whether to pre-build optimization branches. These were bets about the future, informed by experience and intuition, and they could not be delegated.

The AI brought:

Rapid code generation. A complete Pydantic model hierarchy for GUVI request/response formats, generated in one pass and correct on the first try. A Firestore session management module with batch writes, evidence accumulation, and cross-session lookup. A rate limiter. A circuit breaker. Each of these would take an experienced developer hours to write from scratch; Claude Code produced them in minutes, with tests.
Architecture patterns. The pipeline context enrichment pattern --- where metadata flows through the processing pipeline and any stage can add prompt sections --- was suggested by Claude Code when we described the problem of "different stages need to inject different instructions into the Gemini prompt." The pattern was clean, extensible, and immediately right.
Regex complexity. Eleven evidence extractors, each with regex patterns tailored to Indian financial identifiers. IFSC codes (SBIN0001234 --- four letters, literal zero, six alphanumerics). PAN numbers (ABCDE1234F with valid fourth-character entity codes). Aadhaar numbers with Verhoeff checksum validation. UPI IDs distinguishing user@oksbi from user@gmail.com. Writing these patterns from scratch, with all their edge cases and negative lookaheads, would have been the most tedious part of the project. Claude Code generated them quickly, and more importantly, generated comprehensive test cases alongside them.
Refactoring at scale. When we decided to move from delayed Cloud Tasks callbacks to per-turn synchronous callbacks, the change touched seven files. Claude Code made the changes consistently, updated the tests, and removed dead code --- all in a single session.

Where It Worked Brilliantly¶

Persona Engineering¶

This was the crown jewel of the collaboration. We described what Sharma Uncle should sound like: "a retired SBI banker, 67, Dwarka, Delhi, struggles with technology, calls everyone beta, references his wife Kamla and son Rohit." Claude Code turned that description into a 200-line system prompt with:

Speech patterns: Hinglish mixing ("Arey beta, yeh kya bol rahe ho?"), deliberate spelling mistakes, slow typing indicators
Family references as extractive delays: "Ek minute, Rohit ko phone karta hoon, woh sab jaanta hai" (one minute, let me call Rohit, he knows about these things) --- a delay tactic that forces the scammer to wait while also sounding completely natural
Strategic behaviors: "Ask for their employee ID before giving any information," "Express confusion about UPI but eventually ask for their payment details"
Emotional ranges: From trusting ("Theek hai beta, aap bolo") to suspicious ("Lekin SBI toh kabhi phone pe OTP nahi maangta") to panicked ("Arey baap re, mera account band ho jayega?")

We could not have written this prompt in reasonable time. But Claude Code could not have known what to write without the cultural direction. The "Kamla ko bolo, Rohit ko phone karo" delays are only effective because they mirror how actual Indian family conversations work. The suspicion about OTP requests specifically references how SBI actually communicates with customers. These details came from the human; the structured, comprehensive prompt came from the AI.

Test Generation¶

276 tests. We did not write most of them manually. The workflow was:

Implement a feature (e.g., Aadhaar extraction with Verhoeff checksum)
Tell Claude Code: "Write comprehensive tests, including edge cases"
Review the generated tests, add any missing cases
Run them

Claude Code was remarkably good at generating adversarial test cases --- the inputs designed to break your code. For phone number extraction: "Does it handle +91 98765 43210 with spaces? What about 0091-9876543210? What about a 10-digit number embedded inside a 14-digit bank account number?" These are the tests that experienced developers write and junior developers miss. Having them generated automatically was like having a thorough QA engineer on call.

Architecture and Refactoring¶

The session management system went through three iterations: in-memory (fragile, lost on cold starts), Firestore with individual writes (correct but slow), and Firestore with batch writes and local evidence merging (correct and fast). Each refactoring session produced clean diffs, updated tests, and no regressions. The AI's ability to hold the entire codebase context and make consistent changes across files was invaluable.

Where It Broke Down¶

Silent Failures¶

The most dangerous class of AI-generated bugs were the ones that did not crash. The Firestore initialization pattern is the canonical example: Claude Code generated code that imported Firebase Admin SDK and called firebase_admin.initialize_app() at module level. This works in Cloud Functions (where the SDK is available) and fails silently in test environments (where it is not configured). The failure mode was not an exception --- it was Firestore operations returning None or empty results, which downstream code treated as "no data found."

We discovered this when tests passed locally but the deployed function behaved differently. The fix --- extracting utilities into separate modules that do not trigger Firebase initialization on import --- came from the security audit, not from the AI.

Watch for Silent Success

The most dangerous line of code is not the one that throws an error. It is the one that returns a plausible default value when it should have failed. AI code generators optimize for "runs without errors," which sometimes means generating code that handles failures by returning empty results instead of raising exceptions.

Context Window Limitations¶

By session 50, our codebase was large enough that Claude Code could not hold all of it in context simultaneously. We started seeing sessions that began with:

"This session is being continued from a previous conversation. Here's what we've built so far..."

These continuation sessions worked, but they lost nuance. The AI remembered the architecture but forgot the edge cases we had discussed. It remembered the function signatures but forgot why we chose certain parameter orders. Each continuation was a small reset that required re-establishing context.

The practical workaround was aggressive documentation. We maintained a CLAUDE.md file with project conventions, a MEMORY.md with infrastructure details, and inline comments explaining non-obvious decisions. This documentation was written as much for the AI's future sessions as for human readers.

The Security Audit Findings¶

On February 13, the security audit found four critical vulnerabilities --- all in code that Claude Code had generated or contributed to:

API key comparison using ==: The validate_api_key() function used Python's == operator to compare the provided key with the expected key. This is vulnerable to timing attacks --- the comparison short-circuits on the first mismatched byte, and an attacker can measure response time to determine how many leading bytes of the key they have correct. The fix was hmac.compare_digest(), which takes constant time regardless of where the mismatch occurs.
Missing OIDC verification on the callback endpoint: The send_delayed_callback Cloud Function accepted any HTTP request without verifying that it came from Cloud Tasks. Anyone who discovered the endpoint could trigger false intelligence reports. This is the kind of security gap that AI code generators reliably produce --- the happy path works, but the authentication boundary is missing.
Secret not wired to the callback function: SCAMSHIELD_API_KEY was configured for the main webhook function but not for the callback function. In production, callbacks failed silently because the API key was empty.
Callback payload field names mismatched with GUVI spec: session_id instead of sessionId, scam_detected instead of scamDetected. The callback was sending data that GUVI's API silently ignored because the field names were wrong.

None of these were exotic vulnerabilities. They were standard security practices (constant-time comparison, endpoint authentication, secret distribution, API contract validation) that the AI did not apply because it was optimizing for "code that works" rather than "code that is secure."

Lesson: AI Does Not Think Adversarially by Default

Claude Code generates code that handles the expected inputs correctly. It does not spontaneously consider what happens when an attacker sends a crafted request, or when a secret is missing in one deployment target but present in another, or when a third-party API silently drops fields with the wrong naming convention. Security review of AI-generated code is not optional --- it is more important than security review of human-written code, because humans at least have the anxiety of "what if someone tries to break this?"

Model Naming Chaos¶

A small but representative failure: during development, Google renamed their Gemini models. gemini-3.0-flash became gemini-3-flash-preview became gemini-3-flash. We spent three commits and 45 minutes debugging 404 errors from the Gemini API because the model name we specified did not match any available model. Claude Code could not help because its training data did not include the latest model naming conventions. We had to discover the correct name through trial and error.

This is a fundamental limitation of AI coding assistants: they know the world as of their training cutoff. When APIs change --- and LLM APIs change constantly --- the AI's knowledge is stale. For rapidly-evolving dependencies, the human has to drive.

The Multiplier Effect¶

Here are the numbers:

3 weeks of calendar time (Feb 1--21, 2026)
80 Claude Code sessions, averaging roughly 30--45 minutes each
87 commits with meaningful changes
~4,500 lines of Python across functions, extractors, dashboard, and tests
276 tests covering classification, extraction, session management, callbacks, and API contracts
9 dashboard pages with live session monitoring, evidence browsing, analytics
Full CI/CD with GitHub Actions, Workload Identity Federation, automated deployment
Security audit with 4 critical findings fixed and verified
3 personas with detailed cultural authenticity
11 evidence types with specialized regex patterns
12 scam categories with classification heuristics

One person built this. One person plus an AI coding partner.

The honest assessment: without Claude Code, this would have taken a team of 3--4 developers the same three weeks, or one developer 8--10 weeks. The AI did not make us ten times faster at everything --- it made us three to four times faster at the things that are slow for humans (writing boilerplate, generating tests, refactoring across files, writing regex patterns) and zero times faster at the things that are slow for everyone (understanding the problem, making strategic decisions, debugging production issues, and thinking about security).

The multiplier is not uniform. It is concentrated in code generation, test writing, and mechanical refactoring. It is absent in product design, security thinking, and debugging novel problems. Knowing where the multiplier applies --- and where it does not --- is the skill that makes AI-assisted development productive rather than just fast.

Advice for Others¶

If you are building a project with an AI coding partner, here is what we learned:

1. Invest in context documents early. A CLAUDE.md file that describes your project conventions, architecture decisions, and naming patterns pays for itself within three sessions. The AI performs dramatically better when it knows "we use Pydantic models, Firestore for persistence, and camelCase for GUVI-facing fields."

2. Review AI-generated code like you would review a junior developer's PR. It is usually correct. It is sometimes subtly wrong. It almost never considers security implications, race conditions, or failure modes that do not produce exceptions. Read every line. Question every default.

3. Use the AI for the boring parts. Writing 276 tests by hand is soul-crushing. Writing 11 regex patterns for Indian financial identifiers is tedious. Generating Pydantic models from an API spec is mechanical. These are the tasks where AI assistance provides the highest return on investment. Do not waste AI sessions on creative or strategic work --- that is where humans are faster and better.

4. Maintain a clear boundary between "AI generates" and "human decides." The AI should never make strategic decisions (which features to build, when to submit, what scoring categories to prioritize). The human should never hand-write boilerplate that the AI can generate correctly. When these boundaries blur --- when the human starts accepting AI's architectural suggestions without scrutiny, or when the AI is asked to make product decisions --- quality drops.

5. Plan for context loss. Long projects will exceed the AI's context window. Design your workflow so that each session can start from documentation rather than requiring continuity with the previous session. Write decision documents. Comment non-obvious code. Maintain a memory file.

6. Security audit is mandatory, not optional. Every project built with AI assistance needs a dedicated security review pass. Not because the AI writes insecure code deliberately, but because it optimizes for correctness and does not spontaneously worry about adversarial inputs.

Looking Forward¶

We built ScamShield AI in three weeks with 80 AI-assisted sessions. The system works. It engages scammers with culturally-authentic personas, extracts financial intelligence from their messages, classifies scam types with reasonable accuracy, and reports structured intelligence through callbacks. It has a dashboard, CI/CD, security hardening, and 276 tests.

But the more interesting artifact is not the system. It is the 119 MB of session transcripts that document how a human and an AI collaborated to build it. Those transcripts show the real workflow --- not the polished narrative of a blog post, but the messy reality of "why is this returning a 404" at 2 AM and "the regex is matching email addresses" at 8 PM and "we have been given a final test" at 4 AM.

What those transcripts reveal is that human-AI collaboration is not a revolution. It is an amplifier. It makes a fast developer faster and a careful developer more productive. It does not make a careless developer careful or a confused developer clear. The human still needs to know what to build, why to build it, and how to verify that it works. The AI handles the mechanical translation of intent into code.

Three weeks ago, we started with a question:

"do you have knowledge of what this project is about?"

The answer was no. And then we built it together.

Previous: Chapter 5: The Rework --- Five Milestones in One Night

Back to: Dev Diary Index