Chapter 8: CI/CD and Dashboard --- From Code to Production¶
What We Built¶
Code that only runs on a developer's laptop is not a product. This chapter covers the infrastructure that turns ScamShield AI from a local Python project into a deployed, monitored, and observable system: GitHub Actions CI/CD with keyless authentication, a Streamlit dashboard on Cloud Run with PIN-based auth, and a Cloudflare proxy with a set of configuration gotchas that took days to debug.
Why This Approach¶
The deployment story for ScamShield AI involves three independent components:
- Cloud Functions --- the honeypot API itself, deployed via Firebase CLI
- Cloud Run --- the Streamlit dashboard, deployed via
gcloud run deploy - Cloudflare Worker --- the proxy that sits in front of Cloud Run and serves the dashboard on a custom domain
Each has its own deployment mechanism, its own authentication model, and its own set of surprises. We needed CI/CD that could deploy all three reliably, without storing any credentials in GitHub.
The Code¶
GitHub Actions Pipeline¶
The CI/CD pipeline consists of two workflow files:
flowchart LR
subgraph "test.yml (branches + PRs)"
A[Push to branch / PR to main] --> B[Checkout]
B --> C[Setup Python 3.11]
C --> D[Install deps]
D --> E[pytest tests/ -v]
end
subgraph "deploy.yml (main only)"
F[Push to main] --> G[Run Tests]
G --> H{Tests pass?}
H -- Yes --> I[Deploy Functions]
H -- Yes --> J[Deploy Dashboard]
H -- No --> K[Fail]
subgraph "Deploy Functions"
I --> I1[Auth via WIF]
I1 --> I2[Create venv + install deps]
I2 --> I3[firebase deploy]
end
subgraph "Deploy Dashboard"
J --> J1[Auth via WIF]
J1 --> J2[gcloud run deploy]
end
end
E -.->|PR merged| F
test.yml runs on every push to non-main branches and on PRs to main:
name: Run Tests
on:
push:
branches-ignore: [main]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: |
python -m pip install --upgrade pip
pip install -r functions/requirements.txt
pip install pytest
- run: pytest tests/ -v
deploy.yml runs on push to main and deploys both the Cloud Function and the dashboard:
name: Deploy to Firebase
on:
push:
branches: [main]
workflow_dispatch: # Manual trigger for emergencies
permissions:
contents: read
id-token: write # Required for Workload Identity Federation
The id-token: write Permission
Without id-token: write, the WIF authentication step silently fails with a generic "authentication error" message. This is the most common WIF setup mistake, and the error message gives no hint that it is a permissions issue.
Workload Identity Federation (Keyless Deploys)¶
Traditional CI/CD stores a service account key (a JSON file) as a GitHub secret. This is a security risk: the key never expires, anyone with repo access can extract it, and rotating it requires updating the secret manually.
Workload Identity Federation (WIF) eliminates stored keys entirely. Instead, GitHub Actions proves its identity via an OIDC token, and GCP exchanges that token for temporary credentials:
sequenceDiagram
participant GH as GitHub Actions
participant OIDC as GitHub OIDC Provider
participant WIF as GCP Workload Identity Pool
participant SA as GCP Service Account
participant FB as Firebase / Cloud Run
GH->>OIDC: Request OIDC token
OIDC-->>GH: JWT (iss=github, sub=repo:your-github-org/scamshield-ai)
GH->>WIF: Exchange JWT for GCP token
WIF->>WIF: Verify JWT issuer + subject
WIF-->>GH: Short-lived access token (1 hour)
GH->>SA: Impersonate service account
SA-->>GH: Scoped credentials
GH->>FB: firebase deploy / gcloud run deploy
The authentication step in the workflow:
- name: Authenticate to Google Cloud
uses: google-github-actions/auth@v2
with:
workload_identity_provider: >-
projects/PROJECT_NUMBER/locations/global/
workloadIdentityPools/github-actions-pool/
providers/github-actions-provider
service_account: firebase-deploy@your-gcp-project-id.iam.gserviceaccount.com
The WIF pool is configured to only accept tokens from the your-github-org GitHub organization:
# From setup-wif.sh
gcloud iam workload-identity-pools providers create-oidc github-actions-provider \
--workload-identity-pool=github-actions-pool \
--issuer-uri="https://token.actions.githubusercontent.com" \
--attribute-mapping="google.subject=assertion.sub,attribute.repository=assertion.repository" \
--attribute-condition="assertion.repository_owner == 'your-github-org'"
The Service Account Needs Minimal Roles
The deploy service account (firebase-github-deploy@...) has exactly four IAM roles: Firebase Admin (for function deploys), Cloud Run Admin (for dashboard deploys), Service Account User (for impersonation), and Secret Manager Secret Accessor (for runtime secrets). No more, no less.
The venv Requirement¶
Firebase CLI requires a Python virtual environment at functions/venv when deploying Python Cloud Functions. Without it, firebase deploy fails with a cryptic error. The workflow creates this venv as a build step:
- name: Create venv and install dependencies
run: |
python -m venv functions/venv
functions/venv/bin/pip install -r functions/requirements.txt
Why Not Commit the venv?
Python virtual environments are platform-specific (they contain absolute paths and platform binaries). A venv created on macOS will not work in a Linux CI runner. The workflow must create a fresh venv on every deploy.
Streamlit Dashboard on Cloud Run¶
The dashboard is a Streamlit application deployed to Cloud Run. It provides nine pages for testing, monitoring, and analyzing honeypot sessions.
PIN-based authentication gates every page. The auth system in dashboard/utils/auth.py implements a layered check:
def require_auth():
cookie_mgr = _get_cookie_manager()
# Layer 1: In-memory session state (fastest, no I/O)
if st.session_state.get("authenticated", False):
_render_authenticated_ui(cookie_mgr)
return
# Layer 2: HMAC-signed browser cookie (survives refreshes)
token = cookie_mgr.get(_COOKIE_NAME)
if token and _validate_auth_token(token):
st.session_state.authenticated = True
_render_authenticated_ui(cookie_mgr)
return
# Layer 3: Fall through to PIN login screen
_show_login_screen(cookie_mgr)
st.stop()
The cookie token is HMAC-signed with a key derived from the PIN itself:
def _create_auth_token() -> str:
ts = str(int(time.time()))
key = _derive_signing_key() # SHA-256(salt + PIN)
sig = hmac.new(key, ts.encode(), hashlib.sha256).hexdigest()
return f"{ts}:{sig}"
This means changing the PIN automatically invalidates all existing sessions --- the signing key changes, so old cookies fail verification. No session revocation logic needed.
Firestore-backed lockout prevents brute-force attacks on the PIN:
| Parameter | Value |
|---|---|
| Max attempts | 5 |
| Lockout duration | 15 minutes |
| Lockout scope | Global (shared across all clients) |
| Storage | Firestore auth_lockouts/global_lockout doc |
def _record_failed_attempt() -> int:
ref = _get_lockout_ref()
state = _get_lockout_state()
new_count = state["failed_attempts"] + 1
update = {
"failed_attempts": new_count,
"last_failed_at": SERVER_TIMESTAMP,
}
if new_count >= _MAX_ATTEMPTS:
update["locked_until"] = datetime.now(timezone.utc) + timedelta(
minutes=_LOCKOUT_MINUTES
)
ref.set(update, merge=True)
return new_count
Secrets flow from GCP Secret Manager to the running container through this chain:
GCP Secret Manager
--> Cloud Run --set-secrets (env vars)
--> entrypoint.sh (writes secrets.toml)
--> Streamlit st.secrets
The entrypoint.sh script runs before Streamlit starts:
#!/bin/bash
cat > /app/.streamlit/secrets.toml <<EOF
DASHBOARD_PIN = "${DASHBOARD_PIN}"
SCAMSHIELD_API_KEY = "${SCAMSHIELD_API_KEY}"
EOF
streamlit run app.py --server.port=$PORT --server.headless=true
The Cloudflare Proxy¶
The dashboard runs on Cloud Run behind a Cloudflare Worker proxy. This gives us a custom domain (your-dashboard-domain.example.com), SSL termination, and DDoS protection. But it also introduced the most frustrating bug in the entire project.
The redirect: "manual" gotcha:
Streamlit uses HTTP 302 redirects internally for page navigation and authentication flows. Cloudflare Workers, by default, follow redirects server-side before returning the response to the browser. This breaks Streamlit completely:
// BROKEN: Cloudflare follows the 302 server-side,
// browser never sees the redirect
const response = await fetch(url, { redirect: "follow" });
// WORKING: Cloudflare passes the 302 through,
// browser handles the redirect
const response = await fetch(url, { redirect: "manual" });
When redirect: "follow" is used, the browser receives the final response after Cloudflare has followed all redirects. But Streamlit expects the browser to follow the redirect so it can set cookies and update the URL bar. With server-side redirect following, the browser's URL does not change, cookies are not set on the right path, and the dashboard appears blank or stuck in a redirect loop.
Three Cloudflare Settings That Must Stay OFF
These settings break Streamlit's JavaScript and WebSocket connections:
| Setting | Why It Breaks Streamlit |
|---|---|
| Rocket Loader | Injects a JavaScript loader that defers script execution, breaking Streamlit's WebSocket initialization |
| Email Obfuscation | Rewrites @ characters in page content, corrupting Streamlit HTML and any @-containing data (like UPI IDs) |
| Speed Brain | Speculative preloading breaks Streamlit's single-page application routing |
Dashboard Pages¶
The dashboard has nine pages, each following a strict structure:
"""Page Title -- Brief description."""
import streamlit as st
import sys, os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from utils.auth import require_auth
from utils.firestore_client import fetch_all_sessions
st.set_page_config(page_title="Page Title", page_icon="...", layout="wide")
require_auth() # MUST be called immediately after set_page_config
st.title("Page Title")
# Page content...
| Page | File | Purpose |
|---|---|---|
| Home | dashboard/app.py |
Navigation hub with live metrics |
| Testing | pages/00_testing.py |
Interactive chat with the honeypot |
| Live Sessions | pages/01_live_sessions.py |
Real-time session feed |
| Evidence | pages/02_evidence.py |
Search and filter extracted evidence |
| Analytics | pages/03_analytics.py |
Session funnel charts |
| Intelligence | pages/04_intelligence.py |
Network graph visualization |
| Persona Analytics | pages/05_persona_analytics.py |
Coming soon |
| Campaigns | pages/06_campaigns.py |
Coming soon |
| Engagement | pages/07_engagement.py |
Coming soon |
All Firestore reads go through dashboard/utils/firestore_client.py --- pages never query Firestore directly. All rendering goes through dashboard/utils/display.py for consistent formatting.
Key Architectural Decision¶
GitHub Pages vs. Cloud Run for the dashboard: we chose Cloud Run.
GitHub Pages would have been simpler: push HTML, done. But the dashboard needs server-side Firestore access to display live sessions, evidence, and analytics. A static site would require a separate API backend, CORS configuration, and client-side authentication --- three additional moving parts.
Cloud Run gives us:
- Server-side Firestore access via Application Default Credentials (no API key in the browser)
- Secret Manager integration for PIN and API keys (no secrets in client-side code)
- Auto-scaling from 0 to 2 instances (pay only for usage)
- Custom domain via Cloudflare proxy (no Cloud Run domain exposed)
The tradeoff is operational complexity: we now manage Cloud Run deployments, Cloudflare Worker configuration, and entrypoint.sh secrets mapping. But for a dashboard that handles sensitive scam data, server-side rendering with proper secrets management is worth the complexity.
What We Learned¶
-
WIF setup is fiddly but worth it. The initial setup took three attempts (wrong project number, missing
id-tokenpermission, attribute condition typo). Once working, it eliminates the entire class of "leaked service account key" incidents. Thesetup-wif.shscript captures the exact commands so the setup can be reproduced. -
Streamlit and reverse proxies are an adversarial combination. Streamlit's WebSocket-based architecture interacts poorly with CDN features designed for static content. We learned this empirically: Rocket Loader, Email Obfuscation, and Speed Brain each caused a different mysterious failure mode. The fix was always "turn it off for this zone."
-
redirect: "manual"is a two-word fix for a two-day bug. The blank dashboard page caused by server-side redirect following was the hardest bug to diagnose because the symptoms (blank page, no errors in console) gave no clue about the cause. The fix is trivial once you know it --- but finding it required reading Cloudflare's fetch API documentation line by line. -
PIN auth is simpler than OAuth for internal tools. For a dashboard used by a small team, 6-digit PIN auth with HMAC cookies and Firestore lockout provides adequate security without the complexity of OAuth providers, token refresh, and user management. The key insight: derive the cookie signing key from the PIN so that changing the PIN invalidates all sessions automatically.
-
Entrypoint scripts bridge the secret management gap. Cloud Run's
--set-secretsmaps Secret Manager to environment variables, but Streamlit reads fromsecrets.toml. Theentrypoint.shscript bridges this gap by writingsecrets.tomlfrom env vars at container startup. This is a common pattern for any framework that expects file-based configuration in a containerized environment.