Skip to content

Chapter 8: CI/CD and Dashboard --- From Code to Production

What We Built

Code that only runs on a developer's laptop is not a product. This chapter covers the infrastructure that turns ScamShield AI from a local Python project into a deployed, monitored, and observable system: GitHub Actions CI/CD with keyless authentication, a Streamlit dashboard on Cloud Run with PIN-based auth, and a Cloudflare proxy with a set of configuration gotchas that took days to debug.

Why This Approach

The deployment story for ScamShield AI involves three independent components:

  1. Cloud Functions --- the honeypot API itself, deployed via Firebase CLI
  2. Cloud Run --- the Streamlit dashboard, deployed via gcloud run deploy
  3. Cloudflare Worker --- the proxy that sits in front of Cloud Run and serves the dashboard on a custom domain

Each has its own deployment mechanism, its own authentication model, and its own set of surprises. We needed CI/CD that could deploy all three reliably, without storing any credentials in GitHub.

The Code

GitHub Actions Pipeline

The CI/CD pipeline consists of two workflow files:

flowchart LR
    subgraph "test.yml (branches + PRs)"
        A[Push to branch / PR to main] --> B[Checkout]
        B --> C[Setup Python 3.11]
        C --> D[Install deps]
        D --> E[pytest tests/ -v]
    end

    subgraph "deploy.yml (main only)"
        F[Push to main] --> G[Run Tests]
        G --> H{Tests pass?}
        H -- Yes --> I[Deploy Functions]
        H -- Yes --> J[Deploy Dashboard]
        H -- No --> K[Fail]

        subgraph "Deploy Functions"
            I --> I1[Auth via WIF]
            I1 --> I2[Create venv + install deps]
            I2 --> I3[firebase deploy]
        end

        subgraph "Deploy Dashboard"
            J --> J1[Auth via WIF]
            J1 --> J2[gcloud run deploy]
        end
    end

    E -.->|PR merged| F

test.yml runs on every push to non-main branches and on PRs to main:

name: Run Tests
on:
  push:
    branches-ignore: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: |
          python -m pip install --upgrade pip
          pip install -r functions/requirements.txt
          pip install pytest
      - run: pytest tests/ -v

deploy.yml runs on push to main and deploys both the Cloud Function and the dashboard:

name: Deploy to Firebase
on:
  push:
    branches: [main]
  workflow_dispatch:  # Manual trigger for emergencies

permissions:
  contents: read
  id-token: write   # Required for Workload Identity Federation

The id-token: write Permission

Without id-token: write, the WIF authentication step silently fails with a generic "authentication error" message. This is the most common WIF setup mistake, and the error message gives no hint that it is a permissions issue.

Workload Identity Federation (Keyless Deploys)

Traditional CI/CD stores a service account key (a JSON file) as a GitHub secret. This is a security risk: the key never expires, anyone with repo access can extract it, and rotating it requires updating the secret manually.

Workload Identity Federation (WIF) eliminates stored keys entirely. Instead, GitHub Actions proves its identity via an OIDC token, and GCP exchanges that token for temporary credentials:

sequenceDiagram
    participant GH as GitHub Actions
    participant OIDC as GitHub OIDC Provider
    participant WIF as GCP Workload Identity Pool
    participant SA as GCP Service Account
    participant FB as Firebase / Cloud Run

    GH->>OIDC: Request OIDC token
    OIDC-->>GH: JWT (iss=github, sub=repo:your-github-org/scamshield-ai)
    GH->>WIF: Exchange JWT for GCP token
    WIF->>WIF: Verify JWT issuer + subject
    WIF-->>GH: Short-lived access token (1 hour)
    GH->>SA: Impersonate service account
    SA-->>GH: Scoped credentials
    GH->>FB: firebase deploy / gcloud run deploy

The authentication step in the workflow:

- name: Authenticate to Google Cloud
  uses: google-github-actions/auth@v2
  with:
    workload_identity_provider: >-
      projects/PROJECT_NUMBER/locations/global/
      workloadIdentityPools/github-actions-pool/
      providers/github-actions-provider
    service_account: firebase-deploy@your-gcp-project-id.iam.gserviceaccount.com

The WIF pool is configured to only accept tokens from the your-github-org GitHub organization:

# From setup-wif.sh
gcloud iam workload-identity-pools providers create-oidc github-actions-provider \
    --workload-identity-pool=github-actions-pool \
    --issuer-uri="https://token.actions.githubusercontent.com" \
    --attribute-mapping="google.subject=assertion.sub,attribute.repository=assertion.repository" \
    --attribute-condition="assertion.repository_owner == 'your-github-org'"

The Service Account Needs Minimal Roles

The deploy service account (firebase-github-deploy@...) has exactly four IAM roles: Firebase Admin (for function deploys), Cloud Run Admin (for dashboard deploys), Service Account User (for impersonation), and Secret Manager Secret Accessor (for runtime secrets). No more, no less.

The venv Requirement

Firebase CLI requires a Python virtual environment at functions/venv when deploying Python Cloud Functions. Without it, firebase deploy fails with a cryptic error. The workflow creates this venv as a build step:

- name: Create venv and install dependencies
  run: |
    python -m venv functions/venv
    functions/venv/bin/pip install -r functions/requirements.txt

Why Not Commit the venv?

Python virtual environments are platform-specific (they contain absolute paths and platform binaries). A venv created on macOS will not work in a Linux CI runner. The workflow must create a fresh venv on every deploy.

Streamlit Dashboard on Cloud Run

The dashboard is a Streamlit application deployed to Cloud Run. It provides nine pages for testing, monitoring, and analyzing honeypot sessions.

PIN-based authentication gates every page. The auth system in dashboard/utils/auth.py implements a layered check:

def require_auth():
    cookie_mgr = _get_cookie_manager()

    # Layer 1: In-memory session state (fastest, no I/O)
    if st.session_state.get("authenticated", False):
        _render_authenticated_ui(cookie_mgr)
        return

    # Layer 2: HMAC-signed browser cookie (survives refreshes)
    token = cookie_mgr.get(_COOKIE_NAME)
    if token and _validate_auth_token(token):
        st.session_state.authenticated = True
        _render_authenticated_ui(cookie_mgr)
        return

    # Layer 3: Fall through to PIN login screen
    _show_login_screen(cookie_mgr)
    st.stop()

The cookie token is HMAC-signed with a key derived from the PIN itself:

def _create_auth_token() -> str:
    ts = str(int(time.time()))
    key = _derive_signing_key()  # SHA-256(salt + PIN)
    sig = hmac.new(key, ts.encode(), hashlib.sha256).hexdigest()
    return f"{ts}:{sig}"

This means changing the PIN automatically invalidates all existing sessions --- the signing key changes, so old cookies fail verification. No session revocation logic needed.

Firestore-backed lockout prevents brute-force attacks on the PIN:

Parameter Value
Max attempts 5
Lockout duration 15 minutes
Lockout scope Global (shared across all clients)
Storage Firestore auth_lockouts/global_lockout doc
def _record_failed_attempt() -> int:
    ref = _get_lockout_ref()
    state = _get_lockout_state()
    new_count = state["failed_attempts"] + 1

    update = {
        "failed_attempts": new_count,
        "last_failed_at": SERVER_TIMESTAMP,
    }
    if new_count >= _MAX_ATTEMPTS:
        update["locked_until"] = datetime.now(timezone.utc) + timedelta(
            minutes=_LOCKOUT_MINUTES
        )
    ref.set(update, merge=True)
    return new_count

Secrets flow from GCP Secret Manager to the running container through this chain:

GCP Secret Manager
    --> Cloud Run --set-secrets (env vars)
        --> entrypoint.sh (writes secrets.toml)
            --> Streamlit st.secrets

The entrypoint.sh script runs before Streamlit starts:

#!/bin/bash
cat > /app/.streamlit/secrets.toml <<EOF
DASHBOARD_PIN = "${DASHBOARD_PIN}"
SCAMSHIELD_API_KEY = "${SCAMSHIELD_API_KEY}"
EOF

streamlit run app.py --server.port=$PORT --server.headless=true

The Cloudflare Proxy

The dashboard runs on Cloud Run behind a Cloudflare Worker proxy. This gives us a custom domain (your-dashboard-domain.example.com), SSL termination, and DDoS protection. But it also introduced the most frustrating bug in the entire project.

The redirect: "manual" gotcha:

Streamlit uses HTTP 302 redirects internally for page navigation and authentication flows. Cloudflare Workers, by default, follow redirects server-side before returning the response to the browser. This breaks Streamlit completely:

// BROKEN: Cloudflare follows the 302 server-side,
// browser never sees the redirect
const response = await fetch(url, { redirect: "follow" });

// WORKING: Cloudflare passes the 302 through,
// browser handles the redirect
const response = await fetch(url, { redirect: "manual" });

When redirect: "follow" is used, the browser receives the final response after Cloudflare has followed all redirects. But Streamlit expects the browser to follow the redirect so it can set cookies and update the URL bar. With server-side redirect following, the browser's URL does not change, cookies are not set on the right path, and the dashboard appears blank or stuck in a redirect loop.

Three Cloudflare Settings That Must Stay OFF

These settings break Streamlit's JavaScript and WebSocket connections:

Setting Why It Breaks Streamlit
Rocket Loader Injects a JavaScript loader that defers script execution, breaking Streamlit's WebSocket initialization
Email Obfuscation Rewrites @ characters in page content, corrupting Streamlit HTML and any @-containing data (like UPI IDs)
Speed Brain Speculative preloading breaks Streamlit's single-page application routing

Dashboard Pages

The dashboard has nine pages, each following a strict structure:

"""Page Title -- Brief description."""

import streamlit as st
import sys, os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from utils.auth import require_auth
from utils.firestore_client import fetch_all_sessions

st.set_page_config(page_title="Page Title", page_icon="...", layout="wide")
require_auth()  # MUST be called immediately after set_page_config

st.title("Page Title")
# Page content...
Page File Purpose
Home dashboard/app.py Navigation hub with live metrics
Testing pages/00_testing.py Interactive chat with the honeypot
Live Sessions pages/01_live_sessions.py Real-time session feed
Evidence pages/02_evidence.py Search and filter extracted evidence
Analytics pages/03_analytics.py Session funnel charts
Intelligence pages/04_intelligence.py Network graph visualization
Persona Analytics pages/05_persona_analytics.py Coming soon
Campaigns pages/06_campaigns.py Coming soon
Engagement pages/07_engagement.py Coming soon

All Firestore reads go through dashboard/utils/firestore_client.py --- pages never query Firestore directly. All rendering goes through dashboard/utils/display.py for consistent formatting.

Key Architectural Decision

GitHub Pages vs. Cloud Run for the dashboard: we chose Cloud Run.

GitHub Pages would have been simpler: push HTML, done. But the dashboard needs server-side Firestore access to display live sessions, evidence, and analytics. A static site would require a separate API backend, CORS configuration, and client-side authentication --- three additional moving parts.

Cloud Run gives us:

  • Server-side Firestore access via Application Default Credentials (no API key in the browser)
  • Secret Manager integration for PIN and API keys (no secrets in client-side code)
  • Auto-scaling from 0 to 2 instances (pay only for usage)
  • Custom domain via Cloudflare proxy (no Cloud Run domain exposed)

The tradeoff is operational complexity: we now manage Cloud Run deployments, Cloudflare Worker configuration, and entrypoint.sh secrets mapping. But for a dashboard that handles sensitive scam data, server-side rendering with proper secrets management is worth the complexity.

What We Learned

  1. WIF setup is fiddly but worth it. The initial setup took three attempts (wrong project number, missing id-token permission, attribute condition typo). Once working, it eliminates the entire class of "leaked service account key" incidents. The setup-wif.sh script captures the exact commands so the setup can be reproduced.

  2. Streamlit and reverse proxies are an adversarial combination. Streamlit's WebSocket-based architecture interacts poorly with CDN features designed for static content. We learned this empirically: Rocket Loader, Email Obfuscation, and Speed Brain each caused a different mysterious failure mode. The fix was always "turn it off for this zone."

  3. redirect: "manual" is a two-word fix for a two-day bug. The blank dashboard page caused by server-side redirect following was the hardest bug to diagnose because the symptoms (blank page, no errors in console) gave no clue about the cause. The fix is trivial once you know it --- but finding it required reading Cloudflare's fetch API documentation line by line.

  4. PIN auth is simpler than OAuth for internal tools. For a dashboard used by a small team, 6-digit PIN auth with HMAC cookies and Firestore lockout provides adequate security without the complexity of OAuth providers, token refresh, and user management. The key insight: derive the cookie signing key from the PIN so that changing the PIN invalidates all sessions automatically.

  5. Entrypoint scripts bridge the secret management gap. Cloud Run's --set-secrets maps Secret Manager to environment variables, but Streamlit reads from secrets.toml. The entrypoint.sh script bridges this gap by writing secrets.toml from env vars at container startup. This is a common pattern for any framework that expects file-based configuration in a containerized environment.