Evidence Extraction¶

ScamShield AI extracts 14 types of evidence from scammer messages using a combination of regex patterns and keyword detection. Evidence is extracted from all scammer messages in the conversation (not just the current turn), ensuring nothing is missed even if earlier messages are not repeated in subsequent requests.

Extraction Architecture¶

flowchart LR
    subgraph "Input"
        MSG["All Scammer Messages<br/>(concatenated)"]
    end

    subgraph "Regex Extractors (regex_patterns.py)"
        UPI["UPI IDs"]
        BANK["Bank Accounts"]
        PHONE["Phone Numbers"]
        EMAIL["Emails"]
        URL["URLs / Phishing Links"]
        AMT["Amounts"]
        IFSC["IFSC Codes"]
        AADHAAR["Aadhaar Numbers"]
        PAN["PAN Numbers"]
        CRYPTO["Crypto Wallets"]
        CASE["Case IDs"]
        POLICY["Policy Numbers"]
        ORDER["Order Numbers"]
    end

    subgraph "Keyword Extractor (keywords.py)"
        KW["11 Keyword Categories<br/>(pre-compiled regex)"]
    end

    subgraph "Output"
        EV["ExtractedIntelligence<br/>(14 fields, camelCase)"]
    end

    MSG --> UPI & BANK & PHONE & EMAIL & URL & AMT & IFSC & AADHAAR & PAN & CRYPTO & CASE & POLICY & ORDER
    MSG --> KW
    UPI & BANK & PHONE & EMAIL & URL & AMT & IFSC & AADHAAR & PAN & CRYPTO & CASE & POLICY & ORDER --> EV
    KW --> EV

Evidence Types¶

1. UPI IDs (`upiIds`)¶

File: extractors/regex_patterns.py -- extract_upi_ids()

UPI (Unified Payments Interface) IDs follow the format username@bankhandle. The extractor uses two patterns:

Known-handle pattern (high confidence):

\b([a-zA-Z0-9][a-zA-Z0-9._-]{1,255}@(?:oksbi|okaxis|okicici|okhdfcbank|ybl|ibl|axl|sbi|icici|hdfc|paytm|gpay|phonepe|amazonpay|upi|apl|rapl|yapl|kotak|bob|pnb|boi|citi|freecharge|mobikwik|airtel))\b

Covers 25+ known UPI handles including OK-prefixed handles, bank codes, and payment apps.

Generic pattern (broader catch):

\b([a-zA-Z0-9][a-zA-Z0-9._-]{0,49}@[a-zA-Z0-9]{2,30})(?![-.])\b

The negative lookahead (?![-.])\b prevents matching truncated email domains (e.g., offers@fake from offers@fake-amazon.com).

Email exclusion: Results are filtered against a separate email pattern matching common providers (gmail, yahoo, outlook, hotmail) and common TLDs (.com, .org, .in).

Examples:

Input	Extracted
`Send money to fraud@oksbi`	`fraud@oksbi`
`Pay to scammer123@paytm`	`scammer123@paytm`
`UPI: user.name@ybl`	`user.name@ybl`
`contact@gmail.com`	(excluded -- email)

2. Bank Accounts (`bankAccounts`)¶

File: extractors/regex_patterns.py -- extract_bank_accounts()

Two complementary patterns:

Contextual pattern (with keyword anchor):

(?:account\s*(?:number|no|num)?|a/c|ac|acct)[:\s#.-]*(\d{9,18})

Matches account numbers preceded by keywords like "account number", "a/c", "acct". Captures 9-18 digit numbers.

Standalone pattern (digit length only):

\b(\d{11,18})\b

Matches 11-18 digit numbers without context. The higher minimum (11 vs 9) reduces false positives from phone numbers.

Filtering rules:

Length must be 9-18 digits
Exactly 10 digits starting with 6-9 are excluded (phone numbers)
OTPs (4-6 digits) are excluded by minimum length

3. Phone Numbers (`phoneNumbers`)¶

File: extractors/regex_patterns.py -- extract_phone_numbers()

Mobile pattern:

(?<!\d)(?:\+91[\s.-]?|91[\s.-]?|0)?([6-9]\d{9})(?!\d)

Lookbehind (?<!\d) prevents matching inside longer digit sequences (bank accounts)
Lookahead (?!\d) prevents partial matches
Captures 10-digit numbers starting with 6-9 (Indian mobile range)
Optional +91/91/0 prefix

Landline pattern:

(?<!\d)(0\d{2,4})[\s.-]?(\d{6,8})(?!\d)

Captures STD code (0XX to 0XXXX) + number (6-8 digits). Examples: 011-23456789, 0120-1234567.

Two-pass extraction: The extractor runs twice -- once on original text (preserves +91 prefix context) and once on cleaned text (spaces/dashes removed) to catch numbers with embedded formatting.

Output format: When +91 prefix is present in source, returns +91-XXXXXXXXXX format. Plain numbers return as XXXXXXXXXX. This maximizes substring matching against GUVI's fake values.

4. Email Addresses (`emailAddresses`)¶

File: extractors/regex_patterns.py -- extract_emails()

\b([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})\b

Standard RFC-like email pattern. Results are lowercased and deduplicated.

UPI Deduplication

In extract_all_evidence(), emails that were also captured as UPI IDs are removed from the email list. UPI IDs take precedence for scam-related @handle patterns.

5. URLs / Phishing Links (`phishingLinks`)¶

File: extractors/regex_patterns.py -- extract_urls() + is_suspicious_url()

URL patterns:

Full URLs: https?://[^\s<>"{}|\\^[]]+`
www-prefixed (no protocol): (?<!//)www\.[a-zA-Z0-9][-a-zA-Z0-9]*\.[a-zA-Z]{2,}[^\s...]*

Trailing punctuation (., ,, ;, :, !, ?, )) is stripped.

Suspicion scoring (is_suspicious_url()):

A URL is flagged as suspicious if any of these conditions match:

Check	Examples
Shortened domains	`bit.ly`, `tinyurl.com`, `goo.gl`, `t.co`, `ow.ly`, `is.gd`, `buff.ly`, `adf.ly`, `bit.do`, `mcaf.ee`
Suspicious TLDs	`.xyz`, `.top`, `.tk`, `.ml`, `.ga`, `.cf`, `.gq`
Phishing keywords in URL	`login`, `verify`, `secure`, `update`, `confirm`, `account`, `bank`, `kyc`, `invest`, `mining`, `crypto`, `bitcoin`, `payment`, `refund`, `reward`, `prize`, `lottery`, `job`, `earn`, `apply`, `register`, `offer`, `bonus`, `win`, `claim`, `wallet`, `trading`, `fake`, `free`
Brand impersonation	`sbi`, `hdfc`, `icici`, `axis`, `paytm`, `phonepe`, `gpay`, `amazon`, `flipkart`, `google`, `microsoft`, `apple`, `rbi`

In the GUVI output format (transform_to_guvi_format), both suspicious URLs and all URLs are merged into phishingLinks via set union, ensuring complete URL coverage.

6. Amounts (`amounts`)¶

File: extractors/regex_patterns.py -- extract_amounts()

(?:Rs\.?|₹|INR)\s*(\d+(?:,\d{2,3})*(?:\.\d{1,2})?)
|
(\d+(?:,\d{2,3})*(?:\.\d{1,2})?)\s*(?:rupees?|lakhs?|lacs?|crores?)

Two branches: prefix pattern (Rs./INR) and suffix pattern (rupees/lakhs/crores).

Normalization: Commas are removed. Amounts below 100 are filtered out as non-scam-relevant.

Examples:

Input	Extracted
`Rs. 10,000`	`10000`
`50000 rupees`	`50000`
`INR 5,00,000.50`	`500000.50`
`Rs. 50`	(filtered -- below 100)

7. IFSC Codes (`ifscCodes`)¶

File: extractors/regex_patterns.py -- extract_ifsc_codes()

\b([A-Z]{4}0[A-Z0-9]{6})\b

Format: 4 uppercase letters (bank code) + literal 0 + 6 alphanumeric characters. Input is uppercased before matching.

Examples: SBIN0001234, HDFC0000001, ICIC0006543

8. Aadhaar Numbers (`aadhaarNumbers`)¶

File: extractors/regex_patterns.py -- extract_aadhaar_numbers()

(?<!\d)([2-9]\d{3})[\s-]?(\d{4})[\s-]?(\d{4})(?!\d)

12 digits starting with 2-9, optionally separated by spaces or hyphens in groups of 4.

Verhoeff checksum validation: Every candidate is validated using the Verhoeff algorithm, which is the official checksum used by UIDAI for Aadhaar numbers. This eliminates most false positives from random 12-digit sequences.

The implementation uses two precomputed lookup tables (_VERHOEFF_D and _VERHOEFF_P) for the dihedral group D5 multiplication and permutation operations:

def _verhoeff_checksum(num: str) -> bool:
    c = 0
    for i, digit in enumerate(reversed(num)):
        c = _VERHOEFF_D[c][_VERHOEFF_P[i % 8][int(digit)]]
    return c == 0

9. PAN Numbers (`panNumbers`)¶

File: extractors/regex_patterns.py -- extract_pan_numbers()

\b([A-Z]{5}\d{4}[A-Z])\b

Format: 5 uppercase letters + 4 digits + 1 uppercase letter.

Entity code validation: The 4th character must be a valid entity type code from the set {A, B, C, F, G, H, J, L, P, T}:

Code	Entity Type
A	Association of Persons
B	Body of Individuals
C	Company
F	Firm
G	Government
H	Hindu Undivided Family
J	Artificial Juridical Person
L	Local Authority
P	Individual (Person)
T	Trust

10. Crypto Wallets (`cryptoWallets`)¶

File: extractors/regex_patterns.py -- extract_crypto_wallets()

Four patterns for major blockchain networks:

Network	Pattern	Example
Bitcoin Legacy	`[13][a-km-zA-HJ-NP-Z1-9]{25,34}`	`1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa`
Bitcoin Bech32	`bc1[a-zA-HJ-NP-Z0-9]{25,89}`	`bc1qar0srrr7xfkvy5l643lydnw9re59gt...`
Ethereum/EVM	`0x[0-9a-fA-F]{40}`	`0x742d35Cc6634C0532925a3b844Bc9e7595...`
Tron	`T[a-zA-HJ-NP-Z0-9]{33}`	`TJYs9MZqLzJf9b8YTcQo9Q9JqBPeJPwHJV`

Base58 character sets exclude 0, O, I, l to avoid visual ambiguity (standard for Bitcoin).

11. Case IDs (`caseIds`)¶

File: extractors/regex_patterns.py -- extract_case_ids()

Three patterns for law enforcement and administrative references:

FIR numbers:

\bFIR[\s./-]*(?:No\.?\s*)?(\d{4}[\s./-]+\d{3,7}|\d{3,12})\b

Captures FIR-2025-12345, FIR No. 12345, FIR 12345. Normalized to FIR-DIGITS format.

Agency case references:

\b((?:CBI|ED|NCB|NIA|CFSL|SFIO)[\s./-]*\d{4}[\s./-]*\d{3,7})\b

Captures references from CBI, Enforcement Directorate, NCB, NIA, CFSL, SFIO.

Generic case/complaint/ticket references:

(?:case|complaint|ticket|reference|ref|CR|CC)[\s.#:/-]*(?:no\.?|number|id)?\s*([A-Z0-9]{2,5}[\s./-]*\d{4,12}|\d{5,12})

Deduplication: Short matches that are prefixes of longer matches are removed (e.g., CBI-2025 is dropped if CBI-2025-1234 exists).

12. Policy Numbers (`policyNumbers`)¶

File: extractors/regex_patterns.py -- extract_policy_numbers()

(?:policy|insurance|plan|lic)[\s.#:/-]*(?:no\.?|number|id)?\s*([A-Z0-9][-A-Z0-9]{3,24})

Captures insurance/policy references anchored by keywords. Both hyphenated and dehyphenated forms are returned for matching flexibility.

Filtering: Trivially generic matches (all same digit, too short) are excluded. Minimum 5 characters after dehyphenation.

13. Order Numbers (`orderNumbers`)¶

File: extractors/regex_patterns.py -- extract_order_numbers()

Two patterns:

Keyword-anchored:

(?:order|tracking|shipment|delivery|parcel|consignment)[\s.#:/-]*(?:no\.?|number|id)?\s*([A-Z]{0,4}\d{3,15})

Prefix-based (explicit order prefixes):

\b((?:OD|ORD|AWB|TRK|SHP|PKG|INV|DLV)[A-Z0-9]{5,15})\b

Recognizes standard e-commerce and logistics prefixes: OD (order), ORD, AWB (airway bill), TRK (tracking), SHP (shipment), PKG (package), INV (invoice), DLV (delivery).

14. Suspicious Keywords (`suspiciousKeywords`)¶

File: extractors/keywords.py -- extract_suspicious_keywords()

11 keyword categories with 90+ terms, pre-compiled to regex patterns at module load time:

Category	Weight	Example Keywords
`authority`	0.15	police, CBI, cyber cell, court, ED, NCB, customs, income tax
`threat`	0.15	arrest, jail, prison, drugs, parcel, money laundering, hawala, terrorism
`urgency`	0.12	urgent, immediately, 24 hours, today only, last warning, final notice
`otp`	0.12	OTP, one time password, verification code, CVV, PIN, secret code
`banking`	0.10	account blocked, account suspended, RBI, NPCI, bank verification
`kyc`	0.10	KYC, kyc update, kyc expired, PAN, Aadhaar
`crypto`	0.10	bitcoin, BTC, ethereum, USDT, blockchain, mining, guaranteed returns
`money`	0.08	transfer, payment, fee, charges, refund, cashback, deposit, penalty
`action`	0.08	click here, download, install, send money, pay now
`job`	0.05	work from home, daily earning, easy money, commission, youtube likes
`lottery`	0.05	winner, prize, lottery, lucky draw, KBC, jackpot

Matching: Uses word boundary regex (\b) to prevent false positives (e.g., "ED" does not match "compromised", "FIR" does not match "confirm").

Scoring (get_keyword_score()): Returns 0.0-1.0 by summing weights for each category with at least one match. Capped at 1.0.

Output: Sorted by length (shorter = more specific), limited to 15 keywords.

Evidence Accumulation¶

Evidence is accumulated across the entire session via merge_evidence_locally() in firestore/sessions.py:

def merge_evidence_locally(existing_evidence, new_evidence):
    merged = {}
    for field in EVIDENCE_FIELDS:
        existing = existing_evidence.get(field, [])
        new = new_evidence.get(field, [])
        combined = list(set(existing) | set(new))
        # Apply field-specific limits
        limit = _FIELD_LIMITS.get(field)
        if limit is not None:
            combined = combined[:limit]
        merged[field] = combined
    return merged

The only field with a limit is suspiciousKeywords (capped at 15).

For atomic Firestore updates (e.g., in accumulate_evidence()), the system uses ArrayUnion to avoid read-compute-write race conditions.

Format Conversion¶

Evidence is extracted internally using snake_case keys and converted to camelCase for the GUVI API via transform_to_guvi_format():

Internal Key	GUVI Key
`upi_ids`	`upiIds`
`bank_accounts`	`bankAccounts`
`phone_numbers`	`phoneNumbers`
`email_addresses`	`emailAddresses`
`phishing_links` + `urls`	`phishingLinks` (merged via set union)
`amounts`	`amounts`
`ifsc_codes`	`ifscCodes`
`aadhaar_numbers`	`aadhaarNumbers`
`pan_numbers`	`panNumbers`
`crypto_wallets`	`cryptoWallets`
`keywords`	`suspiciousKeywords`
`case_ids`	`caseIds`
`policy_numbers`	`policyNumbers`
`order_numbers`	`orderNumbers`

Evidence Extraction¶

Extraction Architecture¶

Evidence Types¶

1. UPI IDs (upiIds)¶

2. Bank Accounts (bankAccounts)¶

3. Phone Numbers (phoneNumbers)¶

4. Email Addresses (emailAddresses)¶

5. URLs / Phishing Links (phishingLinks)¶

6. Amounts (amounts)¶

7. IFSC Codes (ifscCodes)¶

8. Aadhaar Numbers (aadhaarNumbers)¶

9. PAN Numbers (panNumbers)¶

10. Crypto Wallets (cryptoWallets)¶

11. Case IDs (caseIds)¶

12. Policy Numbers (policyNumbers)¶

13. Order Numbers (orderNumbers)¶

14. Suspicious Keywords (suspiciousKeywords)¶