Skip to content

Evidence Extraction

ScamShield AI extracts 14 types of evidence from scammer messages using a combination of regex patterns and keyword detection. Evidence is extracted from all scammer messages in the conversation (not just the current turn), ensuring nothing is missed even if earlier messages are not repeated in subsequent requests.


Extraction Architecture

flowchart LR
    subgraph "Input"
        MSG["All Scammer Messages<br/>(concatenated)"]
    end

    subgraph "Regex Extractors (regex_patterns.py)"
        UPI["UPI IDs"]
        BANK["Bank Accounts"]
        PHONE["Phone Numbers"]
        EMAIL["Emails"]
        URL["URLs / Phishing Links"]
        AMT["Amounts"]
        IFSC["IFSC Codes"]
        AADHAAR["Aadhaar Numbers"]
        PAN["PAN Numbers"]
        CRYPTO["Crypto Wallets"]
        CASE["Case IDs"]
        POLICY["Policy Numbers"]
        ORDER["Order Numbers"]
    end

    subgraph "Keyword Extractor (keywords.py)"
        KW["11 Keyword Categories<br/>(pre-compiled regex)"]
    end

    subgraph "Output"
        EV["ExtractedIntelligence<br/>(14 fields, camelCase)"]
    end

    MSG --> UPI & BANK & PHONE & EMAIL & URL & AMT & IFSC & AADHAAR & PAN & CRYPTO & CASE & POLICY & ORDER
    MSG --> KW
    UPI & BANK & PHONE & EMAIL & URL & AMT & IFSC & AADHAAR & PAN & CRYPTO & CASE & POLICY & ORDER --> EV
    KW --> EV

Evidence Types

1. UPI IDs (upiIds)

File: extractors/regex_patterns.py -- extract_upi_ids()

UPI (Unified Payments Interface) IDs follow the format username@bankhandle. The extractor uses two patterns:

Known-handle pattern (high confidence):

\b([a-zA-Z0-9][a-zA-Z0-9._-]{1,255}@(?:oksbi|okaxis|okicici|okhdfcbank|ybl|ibl|axl|sbi|icici|hdfc|paytm|gpay|phonepe|amazonpay|upi|apl|rapl|yapl|kotak|bob|pnb|boi|citi|freecharge|mobikwik|airtel))\b

Covers 25+ known UPI handles including OK-prefixed handles, bank codes, and payment apps.

Generic pattern (broader catch):

\b([a-zA-Z0-9][a-zA-Z0-9._-]{0,49}@[a-zA-Z0-9]{2,30})(?![-.])\b

The negative lookahead (?![-.])\b prevents matching truncated email domains (e.g., offers@fake from offers@fake-amazon.com).

Email exclusion: Results are filtered against a separate email pattern matching common providers (gmail, yahoo, outlook, hotmail) and common TLDs (.com, .org, .in).

Examples:

Input Extracted
Send money to fraud@oksbi fraud@oksbi
Pay to scammer123@paytm scammer123@paytm
UPI: user.name@ybl user.name@ybl
contact@gmail.com (excluded -- email)

2. Bank Accounts (bankAccounts)

File: extractors/regex_patterns.py -- extract_bank_accounts()

Two complementary patterns:

Contextual pattern (with keyword anchor):

(?:account\s*(?:number|no|num)?|a/c|ac|acct)[:\s#.-]*(\d{9,18})

Matches account numbers preceded by keywords like "account number", "a/c", "acct". Captures 9-18 digit numbers.

Standalone pattern (digit length only):

\b(\d{11,18})\b

Matches 11-18 digit numbers without context. The higher minimum (11 vs 9) reduces false positives from phone numbers.

Filtering rules:

  • Length must be 9-18 digits
  • Exactly 10 digits starting with 6-9 are excluded (phone numbers)
  • OTPs (4-6 digits) are excluded by minimum length

3. Phone Numbers (phoneNumbers)

File: extractors/regex_patterns.py -- extract_phone_numbers()

Mobile pattern:

(?<!\d)(?:\+91[\s.-]?|91[\s.-]?|0)?([6-9]\d{9})(?!\d)
  • Lookbehind (?<!\d) prevents matching inside longer digit sequences (bank accounts)
  • Lookahead (?!\d) prevents partial matches
  • Captures 10-digit numbers starting with 6-9 (Indian mobile range)
  • Optional +91/91/0 prefix

Landline pattern:

(?<!\d)(0\d{2,4})[\s.-]?(\d{6,8})(?!\d)

Captures STD code (0XX to 0XXXX) + number (6-8 digits). Examples: 011-23456789, 0120-1234567.

Two-pass extraction: The extractor runs twice -- once on original text (preserves +91 prefix context) and once on cleaned text (spaces/dashes removed) to catch numbers with embedded formatting.

Output format: When +91 prefix is present in source, returns +91-XXXXXXXXXX format. Plain numbers return as XXXXXXXXXX. This maximizes substring matching against GUVI's fake values.


4. Email Addresses (emailAddresses)

File: extractors/regex_patterns.py -- extract_emails()

\b([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})\b

Standard RFC-like email pattern. Results are lowercased and deduplicated.

UPI Deduplication

In extract_all_evidence(), emails that were also captured as UPI IDs are removed from the email list. UPI IDs take precedence for scam-related @handle patterns.


File: extractors/regex_patterns.py -- extract_urls() + is_suspicious_url()

URL patterns:

  • Full URLs: https?://[^\s<>"{}|\\^[]]+`
  • www-prefixed (no protocol): (?<!//)www\.[a-zA-Z0-9][-a-zA-Z0-9]*\.[a-zA-Z]{2,}[^\s...]*

Trailing punctuation (., ,, ;, :, !, ?, )) is stripped.

Suspicion scoring (is_suspicious_url()):

A URL is flagged as suspicious if any of these conditions match:

Check Examples
Shortened domains bit.ly, tinyurl.com, goo.gl, t.co, ow.ly, is.gd, buff.ly, adf.ly, bit.do, mcaf.ee
Suspicious TLDs .xyz, .top, .tk, .ml, .ga, .cf, .gq
Phishing keywords in URL login, verify, secure, update, confirm, account, bank, kyc, invest, mining, crypto, bitcoin, payment, refund, reward, prize, lottery, job, earn, apply, register, offer, bonus, win, claim, wallet, trading, fake, free
Brand impersonation sbi, hdfc, icici, axis, paytm, phonepe, gpay, amazon, flipkart, google, microsoft, apple, rbi

In the GUVI output format (transform_to_guvi_format), both suspicious URLs and all URLs are merged into phishingLinks via set union, ensuring complete URL coverage.


6. Amounts (amounts)

File: extractors/regex_patterns.py -- extract_amounts()

(?:Rs\.?|₹|INR)\s*(\d+(?:,\d{2,3})*(?:\.\d{1,2})?)
|
(\d+(?:,\d{2,3})*(?:\.\d{1,2})?)\s*(?:rupees?|lakhs?|lacs?|crores?)

Two branches: prefix pattern (Rs./INR) and suffix pattern (rupees/lakhs/crores).

Normalization: Commas are removed. Amounts below 100 are filtered out as non-scam-relevant.

Examples:

Input Extracted
Rs. 10,000 10000
50000 rupees 50000
INR 5,00,000.50 500000.50
Rs. 50 (filtered -- below 100)

7. IFSC Codes (ifscCodes)

File: extractors/regex_patterns.py -- extract_ifsc_codes()

\b([A-Z]{4}0[A-Z0-9]{6})\b

Format: 4 uppercase letters (bank code) + literal 0 + 6 alphanumeric characters. Input is uppercased before matching.

Examples: SBIN0001234, HDFC0000001, ICIC0006543


8. Aadhaar Numbers (aadhaarNumbers)

File: extractors/regex_patterns.py -- extract_aadhaar_numbers()

(?<!\d)([2-9]\d{3})[\s-]?(\d{4})[\s-]?(\d{4})(?!\d)

12 digits starting with 2-9, optionally separated by spaces or hyphens in groups of 4.

Verhoeff checksum validation: Every candidate is validated using the Verhoeff algorithm, which is the official checksum used by UIDAI for Aadhaar numbers. This eliminates most false positives from random 12-digit sequences.

The implementation uses two precomputed lookup tables (_VERHOEFF_D and _VERHOEFF_P) for the dihedral group D5 multiplication and permutation operations:

def _verhoeff_checksum(num: str) -> bool:
    c = 0
    for i, digit in enumerate(reversed(num)):
        c = _VERHOEFF_D[c][_VERHOEFF_P[i % 8][int(digit)]]
    return c == 0

9. PAN Numbers (panNumbers)

File: extractors/regex_patterns.py -- extract_pan_numbers()

\b([A-Z]{5}\d{4}[A-Z])\b

Format: 5 uppercase letters + 4 digits + 1 uppercase letter.

Entity code validation: The 4th character must be a valid entity type code from the set {A, B, C, F, G, H, J, L, P, T}:

Code Entity Type
A Association of Persons
B Body of Individuals
C Company
F Firm
G Government
H Hindu Undivided Family
J Artificial Juridical Person
L Local Authority
P Individual (Person)
T Trust

10. Crypto Wallets (cryptoWallets)

File: extractors/regex_patterns.py -- extract_crypto_wallets()

Four patterns for major blockchain networks:

Network Pattern Example
Bitcoin Legacy [13][a-km-zA-HJ-NP-Z1-9]{25,34} 1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa
Bitcoin Bech32 bc1[a-zA-HJ-NP-Z0-9]{25,89} bc1qar0srrr7xfkvy5l643lydnw9re59gt...
Ethereum/EVM 0x[0-9a-fA-F]{40} 0x742d35Cc6634C0532925a3b844Bc9e7595...
Tron T[a-zA-HJ-NP-Z0-9]{33} TJYs9MZqLzJf9b8YTcQo9Q9JqBPeJPwHJV

Base58 character sets exclude 0, O, I, l to avoid visual ambiguity (standard for Bitcoin).


11. Case IDs (caseIds)

File: extractors/regex_patterns.py -- extract_case_ids()

Three patterns for law enforcement and administrative references:

FIR numbers:

\bFIR[\s./-]*(?:No\.?\s*)?(\d{4}[\s./-]+\d{3,7}|\d{3,12})\b

Captures FIR-2025-12345, FIR No. 12345, FIR 12345. Normalized to FIR-DIGITS format.

Agency case references:

\b((?:CBI|ED|NCB|NIA|CFSL|SFIO)[\s./-]*\d{4}[\s./-]*\d{3,7})\b

Captures references from CBI, Enforcement Directorate, NCB, NIA, CFSL, SFIO.

Generic case/complaint/ticket references:

(?:case|complaint|ticket|reference|ref|CR|CC)[\s.#:/-]*(?:no\.?|number|id)?\s*([A-Z0-9]{2,5}[\s./-]*\d{4,12}|\d{5,12})

Deduplication: Short matches that are prefixes of longer matches are removed (e.g., CBI-2025 is dropped if CBI-2025-1234 exists).


12. Policy Numbers (policyNumbers)

File: extractors/regex_patterns.py -- extract_policy_numbers()

(?:policy|insurance|plan|lic)[\s.#:/-]*(?:no\.?|number|id)?\s*([A-Z0-9][-A-Z0-9]{3,24})

Captures insurance/policy references anchored by keywords. Both hyphenated and dehyphenated forms are returned for matching flexibility.

Filtering: Trivially generic matches (all same digit, too short) are excluded. Minimum 5 characters after dehyphenation.


13. Order Numbers (orderNumbers)

File: extractors/regex_patterns.py -- extract_order_numbers()

Two patterns:

Keyword-anchored:

(?:order|tracking|shipment|delivery|parcel|consignment)[\s.#:/-]*(?:no\.?|number|id)?\s*([A-Z]{0,4}\d{3,15})

Prefix-based (explicit order prefixes):

\b((?:OD|ORD|AWB|TRK|SHP|PKG|INV|DLV)[A-Z0-9]{5,15})\b

Recognizes standard e-commerce and logistics prefixes: OD (order), ORD, AWB (airway bill), TRK (tracking), SHP (shipment), PKG (package), INV (invoice), DLV (delivery).


14. Suspicious Keywords (suspiciousKeywords)

File: extractors/keywords.py -- extract_suspicious_keywords()

11 keyword categories with 90+ terms, pre-compiled to regex patterns at module load time:

Category Weight Example Keywords
authority 0.15 police, CBI, cyber cell, court, ED, NCB, customs, income tax
threat 0.15 arrest, jail, prison, drugs, parcel, money laundering, hawala, terrorism
urgency 0.12 urgent, immediately, 24 hours, today only, last warning, final notice
otp 0.12 OTP, one time password, verification code, CVV, PIN, secret code
banking 0.10 account blocked, account suspended, RBI, NPCI, bank verification
kyc 0.10 KYC, kyc update, kyc expired, PAN, Aadhaar
crypto 0.10 bitcoin, BTC, ethereum, USDT, blockchain, mining, guaranteed returns
money 0.08 transfer, payment, fee, charges, refund, cashback, deposit, penalty
action 0.08 click here, download, install, send money, pay now
job 0.05 work from home, daily earning, easy money, commission, youtube likes
lottery 0.05 winner, prize, lottery, lucky draw, KBC, jackpot

Matching: Uses word boundary regex (\b) to prevent false positives (e.g., "ED" does not match "compromised", "FIR" does not match "confirm").

Scoring (get_keyword_score()): Returns 0.0-1.0 by summing weights for each category with at least one match. Capped at 1.0.

Output: Sorted by length (shorter = more specific), limited to 15 keywords.


Evidence Accumulation

Evidence is accumulated across the entire session via merge_evidence_locally() in firestore/sessions.py:

def merge_evidence_locally(existing_evidence, new_evidence):
    merged = {}
    for field in EVIDENCE_FIELDS:
        existing = existing_evidence.get(field, [])
        new = new_evidence.get(field, [])
        combined = list(set(existing) | set(new))
        # Apply field-specific limits
        limit = _FIELD_LIMITS.get(field)
        if limit is not None:
            combined = combined[:limit]
        merged[field] = combined
    return merged

The only field with a limit is suspiciousKeywords (capped at 15).

For atomic Firestore updates (e.g., in accumulate_evidence()), the system uses ArrayUnion to avoid read-compute-write race conditions.


Format Conversion

Evidence is extracted internally using snake_case keys and converted to camelCase for the GUVI API via transform_to_guvi_format():

Internal Key GUVI Key
upi_ids upiIds
bank_accounts bankAccounts
phone_numbers phoneNumbers
email_addresses emailAddresses
phishing_links + urls phishingLinks (merged via set union)
amounts amounts
ifsc_codes ifscCodes
aadhaar_numbers aadhaarNumbers
pan_numbers panNumbers
crypto_wallets cryptoWallets
keywords suspiciousKeywords
case_ids caseIds
policy_numbers policyNumbers
order_numbers orderNumbers