Adding a New Evidence Extractor¶
Evidence extractors use regex patterns to identify scam-related data in conversation text: UPI IDs, bank accounts, phone numbers, and more. This tutorial walks through adding a new extractor type end-to-end.
Extractor Architecture¶
functions/extractors/
regex_patterns.py # All regex-based extractors + extract_all_evidence()
keywords.py # Keyword category detection
Evidence flows through the system as follows:
Scammer message text
→ extract_all_evidence(text) # Returns dict with snake_case keys
→ transform_to_guvi_format(dict) # Converts to camelCase for GUVI API
→ ExtractedIntelligence(**dict) # Pydantic model
→ Stored in Firestore + sent in response/callback
Step 1: Add the Regex Pattern¶
Open functions/extractors/regex_patterns.py and add your pattern. Follow the existing section structure.
Example: Adding GSTIN (GST Identification Number) Extraction¶
GSTIN format: 2-digit state code + 10-char PAN + 1 entity number + Z + 1 check digit.
Example: 27AADCB2230M1ZR
# ============== GSTIN EXTRACTION ==============
# GSTIN: 2-digit state code (01-37) + PAN (10 chars) + entity number + Z + check digit
GSTIN_PATTERN = re.compile(
r'\b(\d{2}[A-Z]{5}\d{4}[A-Z]\d[A-Z][A-Z0-9])\b'
)
# Valid state codes (01-37, covering all Indian states/UTs)
_VALID_STATE_CODES = {str(i).zfill(2) for i in range(1, 38)}
def extract_gstin(text: str) -> List[str]:
"""Extract GST Identification Numbers from text.
Valid format: 27AADCB2230M1ZR (2-digit state + PAN + entity + Z + check).
State code must be between 01 and 37.
Args:
text: Conversation text to search.
Returns:
List of valid GSTIN strings found.
"""
matches = GSTIN_PATTERN.findall(text.upper())
valid = [m for m in matches if m[:2] in _VALID_STATE_CODES]
return sorted(set(valid))
Pattern Design Guidelines¶
- Use word boundaries (
\b) to prevent matching inside larger strings - Validate beyond regex -- Use post-match validation (state codes, checksums, length) to reduce false positives
- Document the format in the docstring with examples
- Handle case insensitivity -- Either use
re.IGNORECASEor normalize with.upper()/.lower() - Return sorted, deduplicated lists -- Use
sorted(set(...))for consistency
Step 2: Wire into extract_all_evidence()¶
Add your extractor to the extract_all_evidence() function in the same file:
def extract_all_evidence(text: str) -> Dict[str, Any]:
"""Extract all evidence types from text."""
urls = extract_urls(text)
emails = extract_emails(text)
upi_ids = extract_upi_ids(text)
upi_set = {u.lower() for u in upi_ids}
emails = [e for e in emails if e.lower() not in upi_set]
return {
"upi_ids": upi_ids,
"bank_accounts": extract_bank_accounts(text),
"phone_numbers": extract_phone_numbers(text),
"email_addresses": emails,
"urls": urls,
"phishing_links": [url for url in urls if is_suspicious_url(url)],
"amounts": extract_amounts(text),
"ifsc_codes": extract_ifsc_codes(text),
"aadhaar_numbers": extract_aadhaar_numbers(text),
"pan_numbers": extract_pan_numbers(text),
"crypto_wallets": extract_crypto_wallets(text),
"case_ids": extract_case_ids(text),
"policy_numbers": extract_policy_numbers(text),
"order_numbers": extract_order_numbers(text),
"gstin_numbers": extract_gstin(text), # Add here
}
Step 3: Update transform_to_guvi_format()¶
Add the camelCase mapping:
def transform_to_guvi_format(evidence: Dict[str, Any]) -> Dict[str, Any]:
"""Transform internal snake_case evidence to GUVI camelCase format."""
phishing = evidence.get("phishing_links", [])
all_urls = evidence.get("urls", [])
merged_links = sorted(set(phishing) | set(all_urls))
return {
"bankAccounts": evidence.get("bank_accounts", []),
"upiIds": evidence.get("upi_ids", []),
# ... existing fields ...
"orderNumbers": evidence.get("order_numbers", []),
"gstinNumbers": evidence.get("gstin_numbers", []), # Add here
}
Step 4: Add Field to ExtractedIntelligence Model¶
If this is a new evidence type (not fitting into an existing field), add it to the Pydantic model in functions/guvi/models.py:
class ExtractedIntelligence(BaseModel):
"""Intelligence extracted during the honeypot conversation."""
bankAccounts: List[str] = Field(default_factory=list, ...)
upiIds: List[str] = Field(default_factory=list, ...)
# ... existing fields ...
orderNumbers: List[str] = Field(default_factory=list, ...)
gstinNumbers: List[str] = Field( # Add here
default_factory=list,
description="GST Identification Numbers",
)
Step 5: Add Keywords (If Relevant)¶
If your evidence type is associated with specific scam keywords, add them to functions/extractors/keywords.py:
KEYWORD_CATEGORIES = {
# ... existing categories ...
"tax_fraud": [
"GST", "GSTIN", "GST registration", "GST refund",
"tax return", "income tax refund", "TDS refund",
"GST verification", "GST expired",
],
}
Update category_weights in get_keyword_score() if the new category should affect the suspicion score:
Step 6: Write Parametrized Tests¶
Create comprehensive tests in tests/guvi/test_extractors.py:
import pytest
from extractors.regex_patterns import extract_gstin
class TestGSTINExtraction:
@pytest.mark.parametrize(
"text,expected",
[
# Valid GSTIN
("GST number: 27AADCB2230M1ZR", ["27AADCB2230M1ZR"]),
("GSTIN: 07AAECR4756Q1Z2", ["07AAECR4756Q1Z2"]),
# Multiple GSTINs
(
"Seller: 27AADCB2230M1ZR, Buyer: 09AAECI3721N1Z8",
["09AAECI3721N1Z8", "27AADCB2230M1ZR"],
),
# Invalid state code (99 is not a valid state)
("GSTIN 99AADCB2230M1ZR", []),
# Too short / too long
("27AADCB2230M1Z", []), # Missing check digit
("27AADCB2230M1ZRX", []), # Extra character
# No GSTIN present
("No tax info here", []),
("Random numbers 1234567890", []),
# Case insensitivity (should normalize to uppercase)
("gstin: 27aadcb2230m1zr", ["27AADCB2230M1ZR"]),
# Embedded in longer text
(
"Please verify your GST registration 27AADCB2230M1ZR before proceeding",
["27AADCB2230M1ZR"],
),
],
)
def test_gstin_extraction(self, text, expected):
result = extract_gstin(text)
assert result == expected
def test_gstin_not_confused_with_pan(self):
"""PAN is 10 characters; GSTIN is 15. Ensure no cross-contamination."""
text = "PAN: ABCPD1234F is different from GSTIN: 27ABCPD1234F1Z5"
result = extract_gstin(text)
assert "ABCPD1234F" not in result
Test Coverage Checklist¶
- [ ] Standard valid patterns (at least 3 examples)
- [ ] Multiple matches in one text
- [ ] No matches when pattern is absent
- [ ] Invalid format variations (too short, too long, wrong characters)
- [ ] False positive prevention (similar patterns that should NOT match)
- [ ] Case handling (uppercase, lowercase, mixed)
- [ ] Pattern embedded in surrounding text
- [ ] Edge cases specific to your pattern (e.g., boundary digits for state codes)
Step 7: Test Edge Cases¶
Pay special attention to false positives. Common traps:
Overlapping Patterns¶
def test_gstin_does_not_match_random_alphanumeric(self):
"""15-char alphanumeric strings that happen to match the regex."""
text = "Reference: AB12345678901CD" # Not a real GSTIN
result = extract_gstin(text)
assert result == []
Interaction with Other Extractors¶
def test_no_cross_contamination_with_bank_accounts(self):
"""Ensure GSTIN digits aren't extracted as bank accounts."""
text = "GSTIN: 27AADCB2230M1ZR"
from extractors.regex_patterns import extract_bank_accounts
banks = extract_bank_accounts(text)
assert banks == [] # The digits in GSTIN should not match
Complete Example: Evidence Flow¶
Here is how a new extractor integrates with the full pipeline:
1. Scammer sends: "Verify your GST: 27AADCB2230M1ZR immediately"
2. Handler calls _extract_evidence_from_full_conversation()
→ extract_all_evidence(text)
→ extract_gstin(text) returns ["27AADCB2230M1ZR"]
→ transform_to_guvi_format(evidence)
→ {"gstinNumbers": ["27AADCB2230M1ZR"], ...}
3. ExtractedIntelligence model created:
→ evidence.gstinNumbers == ["27AADCB2230M1ZR"]
4. Evidence stored in Firestore session + evidence_index
5. Response includes:
→ "extractedIntelligence": {"gstinNumbers": ["27AADCB2230M1ZR"], ...}
6. Callback includes cumulative evidence
Existing Extractors Reference¶
| Extractor | Function | Pattern Examples |
|---|---|---|
| UPI IDs | extract_upi_ids() |
fraud@oksbi, scammer@paytm |
| Bank Accounts | extract_bank_accounts() |
9-18 digit numbers with context |
| Phone Numbers | extract_phone_numbers() |
9876543210, +91-8765432109, 011-23456789 |
| Emails | extract_emails() |
fake@domain.com |
| URLs | extract_urls() |
http://..., www.scam.xyz |
| Amounts | extract_amounts() |
Rs. 50,000, 5 lakh rupees |
| IFSC Codes | extract_ifsc_codes() |
SBIN0001234 |
| Aadhaar Numbers | extract_aadhaar_numbers() |
12 digits, Verhoeff-validated |
| PAN Numbers | extract_pan_numbers() |
ABCPD1234F |
| Crypto Wallets | extract_crypto_wallets() |
BTC, ETH, Tron addresses |
| Case IDs | extract_case_ids() |
FIR-2025-12345, CBI-2025-1234 |
| Policy Numbers | extract_policy_numbers() |
POL12345678, LIC policy 12345678 |
| Order Numbers | extract_order_numbers() |
OD123456789, AWB1234567890 |