Session Management¶
ScamShield AI uses Firestore for persistent session storage across Firebase cold starts. Sessions accumulate evidence, track conversation state, and enable cross-session intelligence linking. When Firestore is unavailable, the system falls back to in-memory storage transparently.
Firestore Collections¶
erDiagram
honeypot_sessions {
string doc_id PK "sessionId from GUVI"
string persona "Active persona name"
string scam_type "Classified scam type"
float confidence "Detection confidence 0.0-1.0"
string state "Session state machine value"
int message_count "Total messages exchanged"
bool callback_sent "Whether callback was sent"
map extracted_evidence "ExtractedIntelligence (14 fields)"
array conversation_history "Full message history"
string strategy_state "Self-correction state"
int messages_since_evidence "Counter for stalled convos"
bool high_value_extracted "UPI or bank obtained"
string source "guvi or testing"
string started_at "ISO 8601 timestamp"
string created_at "ISO 8601 timestamp"
string updated_at "ISO 8601 timestamp"
}
evidence_index {
string doc_id PK "type:normalized_value"
string type "upi, bank, phone, email"
string value "Original evidence value"
array sessions "Session IDs where found"
array scam_types "Scam types from sessions"
string first_seen "ISO 8601 timestamp"
string last_seen "ISO 8601 timestamp"
int total_occurrences "Number of unique sessions"
string source "guvi or testing"
}
rate_limits {
string doc_id PK "sessionId"
int total_messages "Lifetime message count"
int minute_key "Current minute epoch bucket"
int minute_count "Messages in current minute"
float first_request "Unix timestamp"
float last_request "Unix timestamp"
}
honeypot_sessions ||--o{ evidence_index : "evidence linked via session IDs"
Collection Schemas¶
honeypot_sessions¶
The primary session collection. Document ID is the GUVI sessionId.
| Field | Type | Default | Description |
|---|---|---|---|
persona |
string | "sharma_uncle" |
Active persona name |
scam_type |
string | null |
Classified scam type (e.g., KYC_BANKING) |
confidence |
float | 0.0 |
Scam detection confidence |
state |
string | "INITIAL" |
Session state machine value |
message_count |
int | 0 |
Total messages exchanged |
callback_sent |
bool | false |
Whether callback has been sent |
extracted_evidence |
map | {} |
Nested map with 14 evidence arrays (see below) |
conversation_history |
array | [] |
Full conversation: {sender, text, timestamp} per message |
strategy_state |
string | "BUILDING_TRUST" |
Self-correction strategy state |
messages_since_evidence |
int | 0 |
Turns since last new high-value evidence |
high_value_extracted |
bool | false |
True if UPI or bank account found |
source |
string | "guvi" |
Request source: guvi or testing |
started_at |
string | (auto) | Session start time (ISO 8601) |
created_at |
string | (auto) | Document creation time |
updated_at |
string | (auto) | Last modification time |
extracted_evidence sub-fields (all List[str]):
bankAccounts, upiIds, phishingLinks, phoneNumbers, emailAddresses, suspiciousKeywords (max 15), ifscCodes, cryptoWallets, aadhaarNumbers, panNumbers, amounts, caseIds, policyNumbers, orderNumbers
evidence_index¶
Cross-session evidence index. Document ID is {type}:{normalized_value} (e.g., upi:fraud@oksbi).
| Field | Type | Description |
|---|---|---|
type |
string | Evidence type: upi, bank, phone, email |
value |
string | Original evidence value |
sessions |
array | List of session IDs where this evidence appeared |
scam_types |
array | Scam types associated with sessions |
first_seen |
string | ISO 8601 timestamp of first occurrence |
last_seen |
string | ISO 8601 timestamp of most recent occurrence |
total_occurrences |
int | Count of unique sessions |
source |
string | Request source |
Only high-value evidence types are indexed: UPI IDs, bank accounts, phone numbers, and email addresses.
rate_limits¶
Per-session rate limiting counters. Document ID is the sessionId.
| Field | Type | Description |
|---|---|---|
total_messages |
int | Lifetime message count for this session |
minute_key |
int | Current minute bucket (int(now / 60)) |
minute_count |
int | Messages in the current minute bucket |
first_request |
float | Unix timestamp of first request |
last_request |
float | Unix timestamp of most recent request |
Session State Machine¶
stateDiagram-v2
[*] --> INITIAL: New session created
INITIAL --> INITIAL: confidence <= 0.7
INITIAL --> ENGAGING: confidence > 0.7
ENGAGING --> ENGAGING: msg_count <= 5, no high-value evidence
ENGAGING --> COMPLIANT: msg_count > 5, no high-value evidence
ENGAGING --> EXTRACTION_SUCCESS: UPI or bank account extracted
COMPLIANT --> COMPLIANT: msg_count <= 10, no high-value evidence
COMPLIANT --> EXTRACTING: msg_count > 10
COMPLIANT --> EXTRACTION_SUCCESS: UPI or bank account extracted
EXTRACTING --> EXTRACTING: continued engagement
EXTRACTING --> EXTRACTION_SUCCESS: UPI or bank account extracted
EXTRACTION_SUCCESS --> EXTRACTION_SUCCESS: [terminal state]
State transitions are computed by Orchestrator._determine_state():
def _determine_state(self, current_state, scam_type, confidence, message_count, evidence):
if current_state == "INITIAL":
return "ENGAGING" if confidence > 0.7 else "INITIAL"
if evidence.upiIds or evidence.bankAccounts:
return "EXTRACTION_SUCCESS"
if message_count > 10:
return "EXTRACTING"
if message_count > 5:
return "COMPLIANT"
return "ENGAGING"
| State | Meaning |
|---|---|
INITIAL |
New session, classification in progress |
ENGAGING |
Scam detected, building trust with scammer |
COMPLIANT |
Good engagement (5+ messages), continuing extraction |
EXTRACTING |
Long engagement (10+ messages), actively extracting |
EXTRACTION_SUCCESS |
High-value evidence (UPI/bank) successfully obtained |
Strategy State Machine¶
Separate from the session state, the strategy state drives self-correction of the extraction approach:
stateDiagram-v2
[*] --> BUILDING_TRUST
BUILDING_TRUST --> EXTRACTING: msg_count >= 3 AND confidence > 0.6
BUILDING_TRUST --> BUILDING_TRUST: still building rapport
EXTRACTING --> DIRECT_PROBE: 4+ msgs without evidence, no high-value
EXTRACTING --> PIVOTING: high-value evidence obtained
DIRECT_PROBE --> BUILDING_TRUST: scammer disengaging (short responses)
DIRECT_PROBE --> DIRECT_PROBE: 3+ msgs, vary tactics
PIVOTING --> PIVOTING: continue extracting scammer identity
| Strategy State | Behavior |
|---|---|
BUILDING_TRUST |
Cooperative, confused victim. Ask basic questions. |
EXTRACTING |
Start requesting payment details. Express willingness but demand verification. |
DIRECT_PROBE |
Direct approach: "Send your UPI ID so I can pay." Express urgency. |
PIVOTING |
Payment details obtained. Now extract scammer's personal info: name, ID, address, email. |
NOT_SCAM bypass
When scam_type == "NOT_SCAM", the strategy stays at BUILDING_TRUST with natural conversation guidance. No extraction tactics are applied.
Session Lifecycle¶
sequenceDiagram
participant H as Handler
participant FS as Firestore
participant EI as Evidence Index
Note over H,EI: Turn 1
H->>FS: get_session(id) → None
H->>FS: save_session(id, initial_state)
H->>H: Extract evidence, classify, generate response
H->>FS: batch_update_session(id, {<br/>evidence, scam_type, confidence,<br/>state, persona, strategy, msg_count})
H->>EI: store_evidence_index(id, evidence)
Note over H,EI: Turn 2+
H->>FS: get_session(id) → SessionState
H->>H: Extract evidence (full conversation)
H->>EI: find_matching_evidence(evidence)
EI-->>H: cross_session_match
H->>H: Classify, select persona, generate response
H->>FS: batch_update_session(id, merged_updates)
H->>EI: store_evidence_index(id, merged_evidence)
Key Implementation Details¶
Lazy initialization: The Firestore client is not created at import time. It is lazily initialized on first use to prevent timeout during Firebase code loading:
def _get_db():
global _db, _firestore_available
if _firestore_available is False:
return None
if _db is None:
with _db_lock:
if _db is None and _firestore_available is not False:
_db = firestore.client()
In-memory fallback: If Firestore initialization fails, all operations fall back to a module-level _memory_sessions dict. This is logged as a warning but the system continues operating.
Batch writes: All session updates are combined into a single set(merge=True) call via batch_update_session(), replacing what was previously 4 separate writes (conversation, evidence, message count, session state).
Evidence accumulation: merge_evidence_locally() performs set union on all 14 evidence fields. The only field with a cap is suspiciousKeywords (limited to 15). For atomic server-side operations, accumulate_evidence() uses Firestore's ArrayUnion.
Callback Trigger Conditions¶
Callbacks are sent to GUVI from turn 1 onward (CALLBACK_MIN_TURN = 1). The rationale:
- GUVI's evaluator may stop at any turn (up to 10).
- The callback endpoint uses
updateHoneyPotFinalResultwith overwrite semantics. - Sending every turn ensures the latest intelligence is always submitted.
- A failed callback on one turn will be retried on the next turn with fresher data.
The original should_send_callback() function in orchestrator.py defines the legacy conditions (used for delayed callbacks):
| Condition | Threshold |
|---|---|
| High-value evidence (UPI or bank) + engagement | message_count >= 10 |
| High confidence + keywords | confidence > 0.85 AND keywords >= 3 AND messages >= 8 |
| Long engagement | message_count >= 10 |
| Already sent | Skip (idempotency) |
Cross-Session Evidence Linking¶
flowchart TD
subgraph "Session A (past)"
A_EV["UPI: fraud@oksbi<br/>Phone: 9876543210"]
end
subgraph "Evidence Index"
IDX_UPI["upi:fraud@oksbi<br/>sessions: [A]<br/>scam_types: [KYC_BANKING]"]
IDX_PHONE["phone:9876543210<br/>sessions: [A]<br/>scam_types: [KYC_BANKING]"]
end
subgraph "Session B (current)"
B_MSG["Scammer sends:<br/>'Send to fraud@oksbi'"]
B_EXTRACT["Extract: fraud@oksbi"]
B_LOOKUP["find_matching_evidence()"]
B_RESULT["is_known_scammer: true<br/>total_matching_sessions: 1<br/>known_scam_types: [KYC_BANKING]"]
end
A_EV -->|"store_evidence_index()"| IDX_UPI
A_EV -->|"store_evidence_index()"| IDX_PHONE
B_MSG --> B_EXTRACT
B_EXTRACT --> B_LOOKUP
B_LOOKUP --> IDX_UPI
IDX_UPI --> B_RESULT
Indexing¶
When evidence is stored via store_evidence_index(), each high-value item (UPI, bank, phone, email) gets its own document in evidence_index:
- Document ID:
{type}:{normalized_value}(lowercased, spaces removed) - Sessions array: Updated via
ArrayUnion(race-safe) - Scam types array: Updated via
ArrayUnion
Lookup¶
find_matching_evidence() takes the current session's evidence and queries the index for each UPI, bank, phone, and email. Results are aggregated:
matches: List of matching items withprevious_sessions,occurrence_count,scam_typestotal_matching_sessions: Count of unique previous sessionsknown_scam_types: Union of all scam types from matching sessionsis_known_scammer: True if any UPI or bank account match hasoccurrence_count > 0
Impact on Processing¶
When is_known_scammer is True:
- Confidence boost:
+0.1 * min(match_count, 3), capped at 0.95 - Aggressive prompt injection: Known scammer alert with match details, demanding employee ID, supervisor name, office address, email, callback number
- Logging: Cross-session match details logged for audit