Chapter 4: Evidence Extraction --- Regex for Indian Financial Identifiers¶

What We Built¶

This chapter covers the evidence extraction pipeline: the regex patterns, keyword detection, and combination logic that turn raw scam conversation text into structured intelligence. We built extractors for 11 distinct evidence types --- UPI IDs, bank accounts, phone numbers, IFSC codes, Aadhaar numbers, PAN cards, cryptocurrency wallets, monetary amounts, case IDs, policy numbers, and order numbers --- plus a keyword detection system with 11 categories and weighted scoring. This chapter breaks down every pattern, explains the Indian-specific design considerations, and shows how the pieces compose.

Why This Approach¶

The Extraction Challenge¶

A scammer's message might contain: "Send Rs. 50,000 to account 50421234567890 or UPI fraud.dept@oksbi. My badge number is CBI-2025-4567. Call me on +91-9876543210." Embedded in that single message are a monetary amount, a bank account number, a UPI ID, a fake case reference, and a phone number. Each has a distinct format. Each must be extracted separately.

Indian financial identifiers have formats that are globally unique. A UPI ID (name@oksbi) looks superficially like an email address but uses banking handles instead of email domains. An IFSC code (SBIN0001234) looks like a random alphanumeric string but follows a rigid structure. An Aadhaar number (12 digits) looks like a long number but includes a Verhoeff checksum. Generic entity extraction models miss these. We needed domain-specific regex.

Why Regex-First, Not LLM-First¶

We use regex as the primary extraction method and the LLM as a backup. This is a deliberate architectural choice:

flowchart LR
    A[Scammer message text] --> B[Regex extraction\n~1ms]
    A --> C[Keyword detection\n~2ms]
    B --> D[Merge results]
    C --> D
    D --> E[Structured evidence\n11 types + keywords]

    style B fill:#2d6a4f,color:#fff
    style C fill:#2d6a4f,color:#fff

Factor	Regex	LLM
Latency	~1ms	~1-2 seconds
Cost	Free	API tokens
Reliability	Deterministic	Probabilistic
UPI ID accuracy	99%+ (known handles)	~85% (may hallucinate)
Aadhaar validation	Verhoeff checksum	Cannot validate
Runs per request	Every request	Optional backup

The LLM is unreliable for structured extraction. It might return "9876543210" as a bank account instead of a phone number. It might miss a UPI ID because the handle is uncommon. It might hallucinate evidence that does not exist in the text. Regex is deterministic: either the pattern matches or it does not. For financial identifiers with known formats, determinism is more valuable than flexibility.

The Code¶

UPI ID Extraction¶

UPI IDs are the single most valuable evidence type. A UPI ID directly identifies the scammer's payment account and can be used to freeze funds.

The format is username@handle, where the handle is one of ~40 Indian payment service providers:

# Common UPI bank handles in India
UPI_HANDLES = [
    "oksbi", "okaxis", "okicici", "okhdfcbank",  # OK handles
    "ybl", "ibl", "axl", "sbi", "icici", "hdfc",  # Bank codes
    "paytm", "gpay", "phonepe", "amazonpay",      # Payment apps
    "upi", "apl", "rapl", "yapl",                  # Generic
    "kotak", "bob", "pnb", "boi", "citi",          # More banks
    "freecharge", "mobikwik", "airtel",            # Wallets
]

# Known-handle UPI pattern
UPI_PATTERN = re.compile(
    r'\b([a-zA-Z0-9][a-zA-Z0-9._-]{1,255}@(?:'
    + '|'.join(UPI_HANDLES) + r'))\b',
    re.IGNORECASE
)

# Generic UPI pattern for unknown handles (more permissive)
UPI_GENERIC_PATTERN = re.compile(
    r'\b([a-zA-Z0-9][a-zA-Z0-9._-]{0,49}@[a-zA-Z0-9]{2,30})(?![-.])\b'
)

The two-pattern approach matters. UPI_PATTERN matches with high confidence when the handle is a known payment provider. UPI_GENERIC_PATTERN catches UPI IDs with novel handles (scammers sometimes create handles on newer or less common banks). The generic pattern's negative lookahead (?![-.])\b prevents matching truncated email domains --- without it, offers@fake-amazon-deals.com would incorrectly match as UPI ID offers@fake.

We also need to exclude email addresses:

EMAIL_PATTERN = re.compile(
    r'\b[a-zA-Z0-9._%+-]+@(?:gmail|yahoo|outlook|hotmail|email|mail)\.[a-zA-Z]{2,}\b',
    re.IGNORECASE
)

def extract_upi_ids(text: str) -> List[str]:
    """Extract UPI IDs from text, excluding email addresses."""
    known_upis = UPI_PATTERN.findall(text)
    generic_upis = UPI_GENERIC_PATTERN.findall(text)

    all_potential = set(known_upis + generic_upis)

    # Remove email addresses
    emails = set(EMAIL_PATTERN.findall(text))
    upis = [upi.lower() for upi in all_potential
            if upi.lower() not in {e.lower() for e in emails}]

    # Filter common email domains that might slip through
    email_domains = ['gmail', 'yahoo', 'outlook', 'hotmail', 'com', 'org', 'in']
    upis = [upi for upi in upis
            if not any(upi.endswith(f'@{d}') for d in email_domains)]

    return list(set(upis))

Why lowercase?

UPI IDs are case-insensitive. Fraud@okSBI and fraud@oksbi are the same account. We normalize to lowercase before deduplication. This also prevents false negatives when the GUVI evaluator checks extracted evidence against expected values.

Bank Account Extraction¶

Indian bank accounts range from 9 to 18 digits. The challenge is distinguishing them from phone numbers (10 digits) and OTPs (4--6 digits):

# Standalone 11-18 digit numbers (excludes 10-digit phones)
BANK_ACCOUNT_PATTERN = re.compile(r'\b(\d{11,18})\b')

# Context-aware: "account number 1234567890"
BANK_ACCOUNT_WITH_CONTEXT = re.compile(
    r'(?:account\s*(?:number|no|num)?|a/c|ac|acct)[:\s#.-]*(\d{9,18})',
    re.IGNORECASE
)

def extract_bank_accounts(text: str) -> List[str]:
    accounts = set()

    # Pattern 1: Numbers with account context (higher confidence)
    contextual = BANK_ACCOUNT_WITH_CONTEXT.findall(text)
    accounts.update(contextual)

    # Pattern 2: Standalone 11-18 digit numbers
    standalone = BANK_ACCOUNT_PATTERN.findall(text)
    accounts.update(standalone)

    # Filter out phone numbers
    valid = []
    for acc in accounts:
        if len(acc) < 9 or len(acc) > 18:
            continue
        if len(acc) == 10 and acc[0] in '6789':
            continue  # Phone number, not account
        valid.append(acc)

    return list(set(valid))

The context-aware pattern (account number X, a/c X, acct X) catches shorter account numbers (9--10 digits) that would be ambiguous without context. The standalone pattern catches longer numbers (11--18 digits) that are unambiguously not phone numbers.

Phone Number Extraction¶

Indian mobile numbers are 10 digits starting with 6, 7, 8, or 9. They may appear with various prefixes and separators:

# Mobile: optional +91/91/0 prefix, then 10 digits starting 6-9
PHONE_PATTERN = re.compile(
    r'(?<!\d)(?:\+91[\s.-]?|91[\s.-]?|0)?([6-9]\d{9})(?!\d)'
)

# Landline: STD code (0XX-0XXXX) + 6-8 digit number
LANDLINE_PATTERN = re.compile(
    r'(?<!\d)(0\d{2,4})[\s.-]?(\d{6,8})(?!\d)'
)

The lookbehind (?<!\d) and lookahead (?!\d) prevent matching phone numbers inside longer digit sequences like bank account numbers. Without these anchors, the 10-digit suffix of a 15-digit bank account number would be incorrectly extracted as a phone number.

The +91 prefix handling is important for GUVI scoring:

def extract_phone_numbers(text: str) -> List[str]:
    phones = []
    seen_digits: set = set()  # deduplicate by 10-digit core

    for match in PHONE_PATTERN.finditer(text):
        core_digits = match.group(1)
        if core_digits in seen_digits:
            continue
        seen_digits.add(core_digits)
        full_match = match.group(0)
        if full_match.startswith('+91'):
            phones.append(f"+91-{core_digits}")
        else:
            phones.append(core_digits)
    ...

Why preserve the +91 prefix?

The GUVI evaluator uses substring matching for phone number scoring. If the expected value is +91-9876543210, then our extracted +91-9876543210 is an exact match, and plain 9876543210 is also a substring match. But if the expected value is 9876543210 and we return +91-9876543210, the evaluator still finds 9876543210 as a substring. Returning the longer form with +91- prefix covers both scoring formats.

IFSC Code Extraction¶

IFSC codes identify bank branches. The format is strictly defined: 4 uppercase letters (bank code) + literal 0 + 6 alphanumeric characters.

IFSC_PATTERN = re.compile(r'\b([A-Z]{4}0[A-Z0-9]{6})\b')

def extract_ifsc_codes(text: str) -> List[str]:
    matches = IFSC_PATTERN.findall(text.upper())
    return sorted(set(matches))

The literal 0 in the fifth position is a distinguishing feature. Without it, the pattern would match many false positives (any 11-character alphanumeric string). The text.upper() normalization ensures we catch lowercase IFSC codes in informal messages.

Aadhaar Number Extraction with Verhoeff Checksum¶

Aadhaar numbers are 12 digits, with the first digit always 2--9 (not 0 or 1). What makes our implementation unique is Verhoeff checksum validation:

AADHAAR_PATTERN = re.compile(
    r'(?<!\d)([2-9]\d{3})[\s-]?(\d{4})[\s-]?(\d{4})(?!\d)'
)

# Verhoeff checksum lookup tables
_VERHOEFF_D = [
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
    [1, 2, 3, 4, 0, 6, 7, 8, 9, 5],
    [2, 3, 4, 0, 1, 7, 8, 9, 5, 6],
    # ... (full 10x10 Dihedral group table)
]

_VERHOEFF_P = [
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
    [1, 5, 7, 6, 2, 8, 3, 0, 9, 4],
    # ... (full 8x10 permutation table)
]

def _verhoeff_checksum(num: str) -> bool:
    """Validate a number string using the Verhoeff algorithm."""
    c = 0
    for i, digit in enumerate(reversed(num)):
        c = _VERHOEFF_D[c][_VERHOEFF_P[i % 8][int(digit)]]
    return c == 0

def extract_aadhaar_numbers(text: str) -> List[str]:
    results = set()
    for g1, g2, g3 in AADHAAR_PATTERN.findall(text):
        digits = g1 + g2 + g3
        if _verhoeff_checksum(digits):
            results.add(digits)
    return sorted(results)

The Verhoeff algorithm detects all single-digit errors and all transposition errors of adjacent digits. This eliminates most false positives: a random 12-digit number has only a 10% chance of passing the checksum. Without Verhoeff validation, every 12-digit number in the conversation (timestamps, reference numbers, amounts) would be flagged as a potential Aadhaar number.

PAN Card Extraction¶

PAN (Permanent Account Number) follows the format: 5 uppercase letters + 4 digits + 1 uppercase letter. The 4th letter encodes the entity type:

PAN_PATTERN = re.compile(r'\b([A-Z]{5}\d{4}[A-Z])\b')

# Valid entity type codes (4th character)
_PAN_ENTITY_CODES = set('ABCFGHJLPT')

def extract_pan_numbers(text: str) -> List[str]:
    matches = PAN_PATTERN.findall(text.upper())
    valid = [m for m in matches if m[3] in _PAN_ENTITY_CODES]
    return sorted(set(valid))

The 4th character must be one of: A (Association of Persons), B (Body of Individuals), C (Company), F (Firm), G (Government), H (Hindu Undivided Family), J (Artificial Juridical Person), L (Local Authority), P (Person/Individual), or T (Trust). Most scam-related PAN numbers use P (individual). The entity code check rejects strings that happen to match the positional pattern but have invalid entity codes.

Amount Extraction¶

Indian monetary amounts use distinctive patterns: the rupee symbol (Rs. or the Unicode rupee sign), Indian numbering words (lakh, crore), and the Indian comma system (1,00,000 instead of 100,000):

AMOUNT_PATTERN = re.compile(
    r'(?:'
    r'(?:Rs\.?|₹|INR)\s*(\d+(?:,\d{2,3})*(?:\.\d{1,2})?)'
    r'|'
    r'(\d+(?:,\d{2,3})*(?:\.\d{1,2})?)\s*(?:rupees?|lakhs?|lacs?|crores?)'
    r')',
    re.IGNORECASE
)

def extract_amounts(text: str) -> List[str]:
    amounts = set()
    for match in AMOUNT_PATTERN.finditer(text):
        raw = match.group(1) or match.group(2)
        if raw:
            normalized = raw.replace(',', '')
            try:
                value = float(normalized)
                if value >= 100:  # Skip trivially small amounts
                    amounts.add(normalized)
            except ValueError:
                continue
    return sorted(amounts)

The threshold of Rs. 100 filters out noise. Amounts below Rs. 100 are rarely relevant in scam contexts (scammers demand thousands or lakhs, not small change). The comma pattern (?:,\d{2,3})* handles both Western commas (1,000) and Indian-style commas (1,00,000).

Keyword Detection with Weighted Scoring¶

Beyond structured identifiers, scam messages contain telltale keywords. We detect 11 categories:

KEYWORD_CATEGORIES = {
    "kyc":       ["KYC", "kyc update", "kyc expired", ...],
    "banking":   ["account blocked", "account suspended", "RBI", ...],
    "otp":       ["OTP", "one time password", "CVV", "PIN", ...],
    "urgency":   ["urgent", "immediately", "24 hours", ...],
    "authority":  ["police", "CBI", "cyber cell", "court", ...],
    "money":     ["transfer", "payment", "fee", "refund", ...],
    "job":       ["work from home", "daily earning", "easy money", ...],
    "lottery":   ["winner", "prize", "lottery", "KBC", ...],
    "threat":    ["arrest", "jail", "drugs", "money laundering", ...],
    "crypto":    ["bitcoin", "BTC", "ethereum", "guaranteed returns", ...],
    "action":    ["click here", "download", "send money", ...],
}

Keywords are matched using pre-compiled word-boundary regex to avoid false positives:

# Pre-compile at module level (avoids ~80 compilations per call)
_KEYWORD_PATTERNS: Dict[str, List[Tuple[str, re.Pattern]]] = {}
for _cat, _kws in KEYWORD_CATEGORIES.items():
    _KEYWORD_PATTERNS[_cat] = [
        (kw, re.compile(r'\b' + re.escape(kw.lower()) + r'\b'))
        for kw in _kws
    ]

Why word boundaries matter

Without \b anchors, "ED" (Enforcement Directorate) would match inside "compromised", "confirmed", and "ED ucation". "FIR" would match inside "confirm" and "first". Word boundary matching eliminates these false positives while still catching standalone uses. Pre-compilation at module level means we pay the regex compilation cost once at import time, not on every request.

The keyword scorer assigns category weights for confidence estimation:

def get_keyword_score(text: str) -> float:
    """Calculate suspicion score (0.0 to 1.0) based on keyword presence."""
    category_weights = {
        "authority": 0.15,   # Police/CBI impersonation
        "threat":    0.15,   # Arrest/jail threats
        "urgency":   0.12,   # Time pressure
        "otp":       0.12,   # OTP/PIN demands
        "banking":   0.10,   # Banking terminology
        "kyc":       0.10,   # KYC-specific terms
        "money":     0.08,   # Payment demands
        "action":    0.08,   # Click/download demands
        "crypto":    0.10,   # Crypto terminology
        "job":       0.05,   # Job scam terms
        "lottery":   0.05,   # Lottery terms
    }

    score = 0.0
    for category, weight in category_weights.items():
        if any(pattern.search(text_lower) for _, pattern in patterns):
            score += weight

    return min(score, 1.0)

The weights reflect severity. Authority impersonation and threats are weighted highest (0.15 each) because they are the strongest scam indicators --- legitimate organizations do not threaten arrest over phone calls. Job and lottery terms are weighted lowest (0.05) because they have more legitimate uses.

Combining Everything: `extract_all_evidence`¶

The master extraction function runs all extractors and combines results:

def extract_all_evidence(text: str) -> Dict[str, Any]:
    """Extract all evidence types from text."""
    urls = extract_urls(text)
    emails = extract_emails(text)
    upi_ids = extract_upi_ids(text)

    # Remove emails that were also captured as UPI IDs
    upi_set = {u.lower() for u in upi_ids}
    emails = [e for e in emails if e.lower() not in upi_set]

    return {
        "upi_ids": upi_ids,
        "bank_accounts": extract_bank_accounts(text),
        "phone_numbers": extract_phone_numbers(text),
        "email_addresses": emails,
        "urls": urls,
        "phishing_links": [url for url in urls if is_suspicious_url(url)],
        "amounts": extract_amounts(text),
        "ifsc_codes": extract_ifsc_codes(text),
        "aadhaar_numbers": extract_aadhaar_numbers(text),
        "pan_numbers": extract_pan_numbers(text),
        "crypto_wallets": extract_crypto_wallets(text),
        "case_ids": extract_case_ids(text),
        "policy_numbers": extract_policy_numbers(text),
        "order_numbers": extract_order_numbers(text),
    }

The deduplication between emails and UPI IDs is important. A string like fraud@oksbi matches both the email pattern and the UPI pattern. We prioritize UPI classification because @oksbi is a payment handle, not an email domain. Removing it from the email list prevents double-counting.

The handler then transforms the internal snake_case format to the GUVI camelCase format:

def transform_to_guvi_format(evidence: Dict[str, Any]) -> Dict[str, Any]:
    """Transform internal snake_case to GUVI camelCase."""
    phishing = evidence.get("phishing_links", [])
    all_urls = evidence.get("urls", [])
    merged_links = sorted(set(phishing) | set(all_urls))

    return {
        "bankAccounts": evidence.get("bank_accounts", []),
        "upiIds": evidence.get("upi_ids", []),
        "phishingLinks": merged_links,
        "phoneNumbers": evidence.get("phone_numbers", []),
        ...
    }

The phishingLinks field merges all URLs (suspicious and non-suspicious) because the GUVI evaluator scores on all extracted links, not just confirmed phishing ones.

Full-Conversation Extraction¶

Evidence extraction runs over all scammer messages in the conversation, not just the latest one:

def _extract_evidence_from_full_conversation(guvi_request: GuviRequest) -> Dict:
    """Extract evidence from ALL scammer messages."""
    honeypot_labels = {"honeypot", "bot", "agent", "assistant", "ai"}
    scammer_texts = []

    # Collect all non-honeypot messages
    for msg in guvi_request.conversationHistory:
        if msg.sender.lower() not in honeypot_labels:
            scammer_texts.append(msg.text)

    if guvi_request.message.sender.lower() not in honeypot_labels:
        scammer_texts.append(guvi_request.message.text)

    full_scammer_text = "\n".join(scammer_texts)
    internal_evidence = extract_all_evidence(full_scammer_text)
    keywords = extract_suspicious_keywords(full_scammer_text)
    internal_evidence["keywords"] = keywords

    return transform_to_guvi_format(internal_evidence)

This is crucial because scammers often reveal evidence progressively: a UPI ID in message 3, a phone number in message 5, a bank account in message 7. Running extraction only on the latest message would miss earlier evidence. Running on the full conversation catches everything.

Key Architectural Decision¶

Regex-first vs. LLM extraction.

We chose regex as the primary extraction method with the LLM available as an optional backup. This was a controversial decision within the team. The argument for LLM-first extraction: the LLM can understand context ("my UPI is..."), handle variations we did not anticipate, and extract entities from natural language phrasing.

We chose regex-first for three reasons:

Determinism. When the evaluator sends fraud@oksbi in a message, regex will extract it 100% of the time. The LLM might extract it 95% of the time, and the 5% failure rate is unpredictable --- sometimes it misses obvious patterns, sometimes it hallucinates patterns that do not exist.
Speed. Regex extraction takes ~1 millisecond. LLM extraction takes 1--2 seconds (an additional API call). In a system where every response must feel like real-time texting, 1--2 seconds of additional latency is noticeable. We already make two Gemini calls per request (classify + generate). A third would push total response time past the believability threshold.
Validation. Regex allows structural validation that LLMs cannot perform. The Verhoeff checksum on Aadhaar numbers, the entity code check on PAN numbers, the known-handle verification on UPI IDs --- these are mathematical checks that eliminate false positives. An LLM cannot compute a Verhoeff checksum.

The tradeoff: regex cannot extract evidence phrased in natural language without structural markers. "Send the money to my SBI account, the number starts with five-oh-four-two" would stump our regex (digits spelled out as words). The LLM backup exists for these edge cases, but in practice, scammers almost always share financial identifiers in their raw alphanumeric form --- they need the victim to type them accurately.

What We Learned¶

Lesson: Pre-compile regex at module level

Our first implementation compiled keyword patterns inside the extract_suspicious_keywords function. With 80+ keywords across 11 categories, this meant ~80 re.compile() calls per request. Profiling showed this consumed 15ms per call --- insignificant for a single request, but it adds up across concurrent requests and cold starts. Moving compilation to module level (happens once at import time) reduced per-request keyword extraction to ~2ms.

Lesson: Negative lookaheads prevent cascading false positives

The generic UPI pattern without (?![-.])\b was our most prolific false-positive source. It matched email fragments, URL components, and even parts of longer identifiers. The negative lookahead --- asserting that the match is not followed by a hyphen or dot --- eliminated 90% of false positives with a 3-character addition to the pattern. When building regex for financial identifiers, always think about what the match should NOT be followed by.

Lesson: Test with parameterized data, not individual cases

We used @pytest.mark.parametrize to test extractors with dozens of input/output pairs:

@pytest.mark.parametrize(
    "text,expected_upis",
    [
        ("Send money to fraud@oksbi", ["fraud@oksbi"]),
        ("Pay to scammer123@paytm", ["scammer123@paytm"]),
        ("Email me at test@gmail.com", []),
        ("UPI: user.name@ybl today", ["user.name@ybl"]),
    ],
)
def test_upi_extraction(self, text, expected_upis):
    result = extract_upi_ids(text)
    assert sorted(result) == sorted(expected_upis)

Every time we encountered a false positive or false negative in production, we added it to the parameterized test data. The test file grew to 100+ cases and became our regression safety net. Adding a new regex pattern and running the test suite immediately showed if we had broken an existing extraction.

Lesson: Evidence accumulation is more important than per-turn extraction

Our initial implementation only extracted evidence from the latest message. The GUVI evaluator scores the callback based on all evidence accumulated across the entire conversation. Switching to full-conversation extraction (all scammer messages concatenated) immediately improved our evidence extraction scores because we stopped "forgetting" evidence from earlier turns.

Previous: Chapter 3 -- Persona Engineering | Next: Chapter 5 -- Intelligence Reporting and Callbacks (coming soon)