Skip to content

Chapter 2: Hardening Under Pressure --- Dashboard, Security, and Deployment

"It Works" Is Not the Same as "It Is Ready"

By the end of Feb 11, we had a system we were proud of. Three personas. Scam classification. Evidence extraction. CI/CD. A nine-page dashboard. Optimization branches ready to merge. Fifty-two commits. Everything worked.

Then we spent three days discovering how much of it was held together with assumptions. Feb 12 through 14 was the transition from "it works in a demo" to "it will not embarrass us in production." The difference turned out to be four security vulnerabilities, a Cloudflare proxy gotcha, a Streamlit rendering bug, and 1,045 lines of tests that should have existed a week earlier.

This is the chapter about what happens after the sprint --- when you stop building features and start asking "what would break if someone tried to break it?"


Part 1: The Dashboard Marathon (Feb 12)

1,045 Lines of Tests

Feb 12 started at 2:24 AM with pipeline performance optimization. It did not stop until the afternoon. The git log shows four commits, but the third one tells the real story:

a64e649 02:24 AM - Optimize dashboard and functions pipeline performance
6058120 12:36 PM - Dashboard UX improvements: home metrics, evidence filters, session replay
9e9f08c 01:38 PM - Fix bugs, harden security, and add test coverage for core components
6ff8ac9 02:02 PM - Move coming-soon sidebar CSS into require_auth() for global application

That third commit --- 9e9f08c --- added 1,045 lines across 17 files. Almost all of it was test code:

Test File What It Covers New Lines
tests/firestore/test_sessions.py Session CRUD, evidence merging, batch updates 276
tests/guvi/test_callback.py Callback service, payload serialization, retry logic 278
tests/guvi/test_handler.py Full request/response cycle, error handling, edge cases 223
tests/utils/test_sanitizer.py Input sanitization, XSS prevention, encoding 146

Until this point, we had zero formal tests. The system had been tested by hand --- send a scam message via curl, read the response, check Firestore. That works for proving a concept. It does not work for a system you are about to submit for competitive evaluation, where a regression in the evidence extractor could silently cost you points.

Lesson: The Right Time for Tests

We did not write tests during the blitz because the architecture was changing too fast. The persona prompt format changed three times on Feb 5. The evidence schema changed twice on Feb 11. Writing tests against an unstable interface creates tests that fail for the wrong reasons and slow you down when speed matters.

But the moment the architecture stabilized --- when we stopped changing data models and started polishing UX --- every hour without tests was a growing liability. Feb 12 was the right day for the test push. Not earlier (we would have rewritten them). Not later (we would have shipped bugs).

The Dashboard Takes Shape

The dashboard had been scaffolded on Feb 11 as nine page stubs. Feb 12 turned the stubs into usable interfaces:

6058120 - Dashboard UX improvements: home metrics, evidence filters, session replay

This single commit reshaped the Testing page into a full interactive chat interface (331 new lines in dashboard/pages/00_testing.py), added filtering to the Evidence page, and introduced a session replay view --- you could step through a scam conversation turn by turn, watching evidence accumulate in real time.

The home page got live metrics from Firestore: active sessions, total evidence items, callback success rates. These metrics served double duty --- they were useful for monitoring, but they were also our first real integration test. If the session count showed zero when we knew conversations had happened, something in the pipeline was broken.


Part 2: The 9 PM Audit (Feb 13)

"View audit_codex and fix these things"

At 8:51 PM on February 13, the developer opened a Claude Code session with a message that would define the evening:

"view audit_codex file and see how we can verify these things and fix these things"

We had written an audit_codex.md earlier in the project --- a running list of things that might be wrong, things to verify, potential attack vectors. It sat in the repo as a to-do list. Now it was time to actually work through it.

What followed was the most consequential two and a half hours of the entire project. Between 8:51 PM and 9:15 PM, we identified four critical or high-severity vulnerabilities. By 9:15 PM, all four were fixed, tested, and committed. The system had been deployed with these vulnerabilities for eight days.

Four Critical Findings

flowchart TD
    A["Audit Started<br/>8:51 PM"] --> B["P0: API Key Bypass<br/>Missing key = allow all"]
    A --> C["P0: No OIDC Verification<br/>Callback endpoint public"]
    A --> D["P1: Secret Wiring Gap<br/>SCAMSHIELD_API_KEY unreachable"]
    A --> E["P1: Payload Mismatch<br/>snake_case vs camelCase"]
    B --> F["ba3bc09<br/>9:15 PM"]
    C --> F
    D --> F
    E --> F
    F --> G["61ac7b0<br/>9:55 PM<br/>Docs updated"]

Finding 1 (P0): API Key Bypass in Production

The validate_api_key function had a dev-mode convenience that nobody remembered to scope:

# BEFORE the fix — in production for 8 days
def validate_api_key(request):
    expected_key = os.environ.get("SCAMSHIELD_API_KEY")
    if not expected_key:
        # If no key configured, allow all requests (dev mode)
        logger.warning("SCAMSHIELD_API_KEY not set - allowing all requests")
        return True

    provided_key = request.headers.get("x-api-key", "")
    return provided_key == expected_key

Two problems in eleven lines. First: if the SCAMSHIELD_API_KEY secret failed to load in production --- a transient GCP error, a misconfigured deployment, a typo in the secrets decorator --- the function would allow all unauthenticated requests. In production. To a system that processes scam conversations.

Second: the == operator for string comparison short-circuits on first mismatch, which means response timing could theoretically leak information about the key's value.

The fix added production awareness:

# AFTER the fix
def validate_api_key(request):
    expected_key = os.environ.get("SCAMSHIELD_API_KEY")
    if not expected_key:
        # In production (K_SERVICE is set by Cloud Functions), deny ALL
        if os.environ.get("K_SERVICE"):
            logger.error("SCAMSHIELD_API_KEY not set in production - denying request")
            return False
        # Local dev only: allow requests without a key
        logger.warning("SCAMSHIELD_API_KEY not set - allowing all requests (dev mode)")
        return True

    provided_key = request.headers.get("x-api-key", "")
    return provided_key == expected_key

K_SERVICE is an environment variable that Google Cloud Functions sets automatically on every deployed instance. If it is present, you are in production. If the API key is missing in production, deny everything and scream into the logs.

How Security Vulnerabilities Actually Happen

This code was not written by someone who did not care about security. It was written at 1 PM on Day 1, during a sprint to get the first deployment working. "Allow all requests if no key" was a conscious dev-mode convenience. The developer intended to add the production check later. Eight days and fifty commits later, "later" had not arrived.

This is how real vulnerabilities happen --- not through ignorance, but through temporary decisions that outlive their context. The code did exactly what was intended when it was written. It just should not have still existed in production a week later.

Finding 2 (P0): Missing OIDC Verification on Callbacks

The send_delayed_callback Cloud Function --- the one that sends intelligence reports to GUVI --- was publicly invokable. No authentication check. Anyone with the URL could trigger a callback.

This was not hypothetical. The function URL was https://asia-south1-{project-id}.cloudfunctions.net/send_delayed_callback. If an attacker discovered this URL, they could send crafted payloads to GUVI's evaluation API, polluting our results with fabricated intelligence data.

The fix had two parts. First, Cloud Tasks now sends an OIDC token when scheduling callbacks:

# In callback_scheduler.py
CLOUD_TASKS_SA_EMAIL = f"{PROJECT_ID}@appspot.gserviceaccount.com"

task = {
    "http_request": {
        "http_method": "POST",
        "url": callback_url,
        "headers": {"Content-Type": "application/json"},
        "body": json.dumps({"sessionId": session_id}).encode(),
        "oidc_token": {
            "service_account_email": CLOUD_TASKS_SA_EMAIL,
            "audience": callback_url,
        },
    },
}

Second, a new functions/utils/oidc.py module verifies the token on the receiving end:

def verify_cloud_tasks_token(request) -> tuple[bool, str]:
    """Verify the OIDC token sent by Cloud Tasks."""
    if not os.environ.get("K_SERVICE"):
        return True, ""  # Skip in local dev

    auth_header = request.headers.get("Authorization", "")
    if not auth_header.startswith("Bearer "):
        return False, "Unauthorized"

    token = auth_header[len("Bearer "):]
    try:
        claims = id_token.verify_oauth2_token(token, google_requests.Request())
    except Exception:
        return False, "Unauthorized"

    # Verify the token came from our project's service account
    expected_sa = f"{project_id}@appspot.gserviceaccount.com"
    if claims.get("email") != expected_sa:
        return False, "Unauthorized"

    return True, ""

Lesson: Extract Utilities Away from main.py

We deliberately put the OIDC verification in utils/oidc.py rather than directly in main.py. This was not just organizational preference --- it was learned from a failed attempt.

The initial approach put the verification inline in main.py. Every unit test that imported main.py immediately triggered Firebase initialization (firebase_admin.initialize_app()), which requires valid GCP credentials. Tests running in CI or on a machine without a service account key failed at import time, before a single test function ran.

Extracting the verifier to a standalone module broke the import chain. Tests can import and exercise the OIDC logic without initializing Firebase, Firestore, or Gemini. This pattern --- keep main.py as a thin routing layer, put all logic in importable modules --- became a project convention after this experience.

Finding 3 (P1): Secret Never Reaching the Function

The send_delayed_callback function needed SCAMSHIELD_API_KEY to authenticate its callback to GUVI. The secret existed in GCP Secret Manager. The validation code referenced it. But the function decorator was missing one line:

# BEFORE: secret never loaded at runtime
@https_fn.on_request(
    timeout_sec=30,
    memory=256,
    region="asia-south1",
)
def send_delayed_callback(request):
    ...
# AFTER: one line added
@https_fn.on_request(
    timeout_sec=30,
    memory=256,
    region="asia-south1",
    secrets=["SCAMSHIELD_API_KEY"],  # <-- this was missing
)
def send_delayed_callback(request):
    ...

One missing line in a decorator. The callback function had been running for days, making HTTPS requests to GUVI's API with no authentication header. Combined with Finding 1, this meant: if the API key was not loaded (because it was not wired), the validation function allowed all requests anyway. Two vulnerabilities compounding into a system that was both unauthenticated inbound and unauthenticated outbound.

Finding 4 (P1): Callback Payload Mismatch

The dashboard's testing page was constructing payloads with Python-convention field names (session_id, sender_number) instead of the camelCase names that the GUVI API expected (sessionId, senderNumber). The callback appeared to succeed --- HTTP 200, no errors --- but GUVI was silently ignoring the unrecognized fields.

We only caught this by reading the API specification line by line during the audit and comparing it against the actual payload we were sending. The fix was straightforward field name alignment, but the fact that it had been silently failing for days without any error signal was unsettling.

One Commit, Four Fixes

All four fixes shipped in a single commit at 9:15 PM:

ba3bc09 - Fix priority 0 audit findings: auth, OIDC, secret wiring, payload shape

11 files changed. 913 insertions. 896 deletions. 19 new tests covering the fixes. Two and a half hours from "view audit_codex" to "deployed and verified."


Part 3: The Gemini 3 Flash Upgrade (Feb 14)

Defensive Model Switching

Feb 14 opened at 7 AM with a model upgrade. After the model naming chaos on Day 1 (three commits to get the right string), we built the Gemini 3 Flash upgrade defensively:

GEMINI_PRIMARY_MODEL = "gemini-3-flash-preview"
GEMINI_FALLBACK_MODEL = "gemini-2.0-flash"

def _call_model(self, contents, config, use_breaker=True):
    """Call generate_content with automatic fallback if model unavailable."""
    for attempt in range(2):
        try:
            return self.client.models.generate_content(
                model=self._active_model,
                contents=contents,
                config=config,
            )
        except Exception as e:
            if attempt == 0 and self._is_model_not_found(e):
                logger.warning(
                    f"Model {self._active_model} unavailable, "
                    f"falling back to {GEMINI_FALLBACK_MODEL}"
                )
                self._active_model = GEMINI_FALLBACK_MODEL
                continue
            raise

If the primary model returns a 404, the client transparently switches to the fallback and caches that decision for the instance lifetime. No more three-commit model name hunts. If Google changes the model name again, the system degrades gracefully instead of crashing.

332a1df - Upgrade to Gemini 3 Flash with automatic fallback to 2.0 Flash

The Temperature Trap

The upgrade came with an undocumented regression. We had been using low temperature values (temperature=0.1) for classification and extraction tasks to get deterministic, structured JSON outputs. Gemini 3 Flash responded to low temperature by looping --- repeating the same JSON fragment until hitting the token limit.

A 500-token classification response would come back as:

{"classification": "KYC_BANKING", "confidence": 0.9, "classification": "KYC_BANKING", "confidence": 0.9, "classification": "KYC_

The fix was to remove explicit temperature settings for structured output tasks and let the model use its default. This behavior was not documented in the Gemini 3 Flash release notes. We found it by staring at responses that were obviously wrong and working backwards to the parameter that changed.

From OAuth to PIN

The same morning brought a deliberate simplification of dashboard authentication:

ded9b10 - Replace Google OAuth with PIN auth, add source tracking, upgrade Gemini

Google OAuth had been implemented on Feb 11. It worked, but it required an OAuth consent screen, authorized domain configuration, token refresh handling, and redirect management through the Cloudflare proxy. For a dashboard used by two people, this was engineering overkill.

We replaced it with a 6-digit PIN backed by HMAC-signed cookies:

def _create_auth_token() -> str:
    """Create an HMAC-signed auth token with the current timestamp."""
    ts = str(int(time.time()))
    key = _derive_signing_key()  # SHA-256(salt + PIN)
    sig = hmac.new(key, ts.encode(), hashlib.sha256).hexdigest()
    return f"{ts}:{sig}"

def _validate_auth_token(token: str) -> bool:
    """Validate an auth token's signature and expiry."""
    if not token or ":" not in token:
        return False

    ts_str, provided_sig = token.split(":", 1)

    # Check expiry (1-hour TTL)
    if time.time() - int(ts_str) > _SESSION_TTL_SECONDS:
        return False

    # Verify signature with constant-time comparison
    key = _derive_signing_key()
    expected_sig = hmac.new(key, ts_str.encode(), hashlib.sha256).hexdigest()
    return hmac.compare_digest(provided_sig, expected_sig)

The signing key is derived from SHA-256(salt + PIN). Changing the PIN changes the signing key, which invalidates all existing cookies instantly --- session revocation for free, with no revocation list to maintain.

Brute-force protection lives in Firestore: 5 failed PIN attempts trigger a 15-minute lockout, tracked in a auth_lockouts/global_lockout document. The lockout runs server-side, so clearing cookies or switching browsers does not reset the counter.

ea26fdf - Add cookie-based session persistence for dashboard PIN auth

The CookieManager Bug

Then came a bug that was trivial in hindsight but baffling in the moment:

4d1a07e - Fix duplicate CookieManager instantiation causing StreamlitDuplicateElementKey

Streamlit's extra-streamlit-components library provides a CookieManager that registers itself as a UI component widget. If you create two instances with the same key in a single render cycle, Streamlit throws StreamlitDuplicateElementKey. Our code was calling _get_cookie_manager() in three different places: once in require_auth(), once in _show_login_screen(), and once in _render_authenticated_ui(). Each call created a new instance. Each instance tried to register the same widget key.

The fix was to create one instance at the top of require_auth() and thread it through as a parameter:

def require_auth():
    # Create once per render cycle --- multiple instantiations would
    # raise StreamlitDuplicateElementKey.
    cookie_mgr = _get_cookie_manager()

    if st.session_state.get("authenticated", False):
        _render_authenticated_ui(cookie_mgr)  # pass, don't recreate
        return

    token = cookie_mgr.get(_COOKIE_NAME)
    if token and _validate_auth_token(token):
        st.session_state.authenticated = True
        _render_authenticated_ui(cookie_mgr)  # pass, don't recreate
        return

    _show_login_screen(cookie_mgr)  # pass, don't recreate
    st.stop()

Streamlit's Execution Model Will Surprise You

If you come from Flask, Django, or React, Streamlit's execution model is unlike anything you have worked with. The entire script reruns on every user interaction. Every function call happens on every render. "Create an object" does not mean "initialize once" --- it means "initialize on every click, every keystroke, every page load."

Component widgets that register themselves --- like CookieManager --- must be instantiated exactly once per render cycle. Not zero times (or you cannot read cookies). Not two times (or you get DuplicateElementKey). Exactly once.

This is the kind of framework-specific constraint that no amount of general programming experience prepares you for. You learn it by hitting the error, reading the traceback, and restructuring your code.


Part 4: The Cloudflare Gotcha

Two Words That Fixed a Two-Day Bug

The dashboard runs on Cloud Run behind a Cloudflare Worker proxy. The proxy gives us a custom domain, SSL termination, and DDoS protection. It also introduced the most frustrating bug of the project.

The symptom: Dashboard loads. PIN screen appears. Enter the correct PIN. Blank page. No error in the browser console. No error in the Cloud Run logs. Just white.

The cause: One parameter in the Cloudflare Worker's fetch call.

// BROKEN: Cloudflare follows 302 redirects server-side
const response = await fetch(backendUrl, { redirect: "follow" });

// WORKING: pass 302 redirects to the browser
const response = await fetch(backendUrl, { redirect: "manual" });

Why redirect: follow Breaks Streamlit

Streamlit uses HTTP 302 redirects for internal page navigation and WebSocket upgrade negotiation. With redirect: "follow", the Cloudflare Worker follows these redirects server-side before returning the response. The browser never sees the 302. Its URL bar does not update. Cookies intended for the redirect target go to the wrong path. Streamlit's client-side JavaScript expects a URL that does not match the actual page.

The result is a perfectly rendered blank page --- the worst kind of failure because there are no error signals to debug. The HTML is valid. The JavaScript loads. The WebSocket never connects. Nothing is visibly wrong.

We found the cause by accessing the Cloud Run URL directly (which worked) and comparing its response headers against the proxied response (which did not). The Location header from the 302 was absent in the proxied response --- because Cloudflare had already followed it.

Three Settings That Must Stay Off

Beyond the redirect issue, three Cloudflare performance features break Streamlit in unrelated ways:

Setting What It Does Why It Breaks Streamlit
Rocket Loader Defers JavaScript execution for faster initial paint Breaks Streamlit's WebSocket initialization sequence
Email Obfuscation Rewrites @ characters in HTML to prevent email harvesting Corrupts UPI IDs like user@oksbi in rendered pages
Speed Brain Speculatively preloads linked pages Sends phantom navigation requests that break SPA routing

Each of these was discovered independently, each after a period of "it was working five minutes ago --- what changed?" The Rocket Loader issue was particularly painful because the setting had been enabled by default on the Cloudflare zone. We never explicitly turned it on. It was just there, silently rewriting our JavaScript.

How We Diagnosed the Rocket Loader Issue

The dashboard worked perfectly on Cloud Run's native URL. It showed a blank page on the custom domain. Same Docker container. Same code. Different behavior.

We compared the raw HTML from both responses and found an extra <script> tag in the proxied version --- Rocket Loader injecting its deferred loading wrapper around Streamlit's JavaScript. Disabling the setting in the Cloudflare dashboard fixed the issue immediately.

The lesson: if your app works on the origin but fails behind a CDN, diff the HTML. Content-rewriting CDN features are always the first suspect.


The Emotional Arc

Looking back at these three days, the emotional trajectory is distinct from the blitz:

Feb 12, 2 AM: Confidence. Pipeline optimizations. Dashboard polish. We are adding finishing touches. Things feel good.

Feb 12, 1 PM: Satisfaction. 1,045 lines of tests. The system has a safety net. We can change things without fear.

Feb 13, 9 PM: Alarm. Four vulnerabilities, two of them P0 critical, in a system that has been live for eight days. The API key bypass means anyone could have called our endpoint. The missing OIDC verification means the callback was publicly invokable. We thought we were polishing a finished system. We were patching a leaking one.

Feb 13, 9:30 PM: Focus. The alarm gives way to methodical work. Find the vulnerability. Write the fix. Write the test. Deploy. Move to the next one.

Feb 13, 11 PM: Relief. ba3bc09 is deployed. All four fixes verified. 19 new tests passing. The most satisfying commit message of the project: "Fix priority 0 audit findings."

Feb 14, morning: Measured progress. The Gemini 3 upgrade goes smoothly because we built the fallback mechanism on Day 1's lesson. The auth rework simplifies the system. The CookieManager bug is annoying but contained.

Feb 14, afternoon: Stability. For the first time since Feb 5, we have a system where the security is audited, the tests pass, the deployment is automated, and we can explain every design decision. It took nine days and sixty-two commits to get here.

Lesson: Do the Audit When It Feels Too Early

We did the security audit on Feb 13 because it was on a checklist, not because we thought we needed it. The system felt ready. It was not.

If we had postponed the audit until "after the presentation" or "when things calm down," those four vulnerabilities would have been present during GUVI evaluation. The audit cost us one evening. Not doing the audit could have cost us the competition.

Schedule your security review before you think you need it. You will find things. You always find things.


What Changed Between Feb 12 and Feb 14

Before Hardening (End of Feb 11)
  • API key validation: allowed all requests when secret missing
  • Callback endpoint: publicly invokable, no authentication
  • Secret wiring: SCAMSHIELD_API_KEY not reaching the callback function
  • Callback payloads: snake_case field names, silently rejected by GUVI
  • Dashboard auth: Google OAuth (complex, over-engineered)
  • Model: Gemini 2.0 Flash, no fallback mechanism
  • Cloudflare: redirect: "follow" breaking Streamlit navigation
  • Test coverage: zero formal tests
After Hardening (End of Feb 14)
  • API key validation: production-aware deny logic with K_SERVICE check
  • Callback endpoint: OIDC-verified, service account email validated
  • Secret wiring: end-to-end from Secret Manager through decorator to runtime
  • Callback payloads: camelCase, matching GUVI spec exactly
  • Dashboard auth: 6-digit PIN, HMAC cookies, Firestore lockout (5 attempts / 15 min)
  • Model: Gemini 3 Flash with automatic fallback to 2.0 Flash
  • Cloudflare: redirect: "manual", Rocket Loader / Email Obfuscation / Speed Brain disabled
  • Test coverage: 1,045+ lines across session, callback, handler, and sanitizer tests

The system that existed on Feb 14 was the system we would present and submit. Everything that came after --- competition analysis, new extractors, scoring optimization --- was enhancement. The foundation was solid.


Previous: Chapter 1: The Blitz --- Zero to Working System

Next: Chapter 3: The Presentation