Pattern Extraction
Matchy includes a high-performance pattern extractor for finding domains, IP addresses (IPv4 and IPv6), and email addresses in unstructured text like log files.
Overview
The PatternExtractor
uses SIMD-accelerated algorithms to scan text and extract patterns at 200-500 MB/sec. This is useful for:
- Log scanning: Find domains/IPs in access logs, firewall logs, etc.
- Threat detection: Extract indicators from security logs
- Analytics: Count unique domains/IPs in large datasets
- Compliance: Find email addresses or PII in audit logs
- Forensics: Extract patterns from binary logs
Quick Start
#![allow(unused)] fn main() { use matchy::extractor::PatternExtractor; let extractor = PatternExtractor::new()?; let log_line = b"2024-01-15 GET /api evil.example.com 192.168.1.1"; for match_item in extractor.extract_from_line(log_line) { println!("Found: {}", match_item.as_str(log_line)); } // Output: // Found: evil.example.com // Found: 192.168.1.1 }
Supported Patterns
Domains
Extracts fully qualified domain names with TLD validation:
#![allow(unused)] fn main() { let line = b"Visit api.example.com or https://www.github.com/path"; for match_item in extractor.extract_from_line(line) { if let ExtractedItem::Domain(domain) = match_item.item { println!("Domain: {}", domain); } } // Output: // Domain: api.example.com // Domain: www.github.com }
Features:
- TLD validation: 3.6M+ real TLDs from Public Suffix List
- Unicode support: Handles münchen.de, café.fr (with punycode)
- Subdomain extraction: Extracts full domain from URLs
- Word boundaries: Avoids false positives in non-domain text
IPv4 Addresses
Extracts all valid IPv4 addresses:
#![allow(unused)] fn main() { let line = b"Traffic from 10.0.0.5 to 172.16.0.10"; for match_item in extractor.extract_from_line(line) { if let ExtractedItem::Ipv4(ip) = match_item.item { println!("IP: {}", ip); } } // Output: // IP: 10.0.0.5 // IP: 172.16.0.10 }
Features:
- SIMD-accelerated: Uses
memchr
for fast dot detection - Validation: Rejects invalid IPs (256.1.1.1, 999.0.0.1)
- Word boundaries: Avoids false matches in version numbers
IPv6 Addresses
Extracts all valid IPv6 addresses:
#![allow(unused)] fn main() { let line = b"Server at 2001:db8::1 responded from fe80::1"; for match_item in extractor.extract_from_line(line) { if let ExtractedItem::Ipv6(ip) = match_item.item { println!("IPv6: {}", ip); } } // Output: // IPv6: 2001:db8::1 // IPv6: fe80::1 }
Features:
- SIMD-accelerated: Uses
memchr
for fast colon detection - Compressed notation: Handles
::
and full addresses - Validation: Full RFC 4291 compliance via Rust's
Ipv6Addr
- Mixed notation: Supports
::ffff:127.0.0.1
format
Email Addresses
Extracts RFC 5322-compliant email addresses:
#![allow(unused)] fn main() { let line = b"Contact alice@example.com or bob+tag@company.org"; for match_item in extractor.extract_from_line(line) { if let ExtractedItem::Email(email) = match_item.item { println!("Email: {}", email); } } // Output: // Email: alice@example.com // Email: bob+tag@company.org }
Features:
- Plus addressing: Supports user+tag@example.com
- Subdomain validation: Checks domain part for valid TLD
Configuration
Customize extraction behavior using the builder pattern:
#![allow(unused)] fn main() { use matchy::extractor::PatternExtractor; let extractor = PatternExtractor::builder() .extract_domains(true) // Enable domain extraction .extract_ipv4(true) // Enable IPv4 extraction .extract_ipv6(true) // Enable IPv6 extraction .extract_emails(false) // Disable email extraction .min_domain_labels(3) // Require 3+ labels (api.test.com) .require_word_boundaries(true) // Enforce word boundaries .build()?; }
Configuration Options
Option | Default | Description |
---|---|---|
extract_domains | true | Extract domain names |
extract_ipv4 | true | Extract IPv4 addresses |
extract_ipv6 | true | Extract IPv6 addresses |
extract_emails | true | Extract email addresses |
min_domain_labels | 2 | Minimum labels (2 = example.com, 3 = api.example.com) |
require_word_boundaries | true | Ensure patterns have word boundaries |
Unicode and IDN Support
The extractor handles Unicode domains automatically:
#![allow(unused)] fn main() { let line = "Visit münchen.de or café.fr".as_bytes(); for match_item in extractor.extract_from_line(line) { if let ExtractedItem::Domain(domain) = match_item.item { println!("Unicode domain: {}", domain); } } // Output: // Unicode domain: münchen.de // Unicode domain: café.fr }
How it works:
- Extracts Unicode text as-is
- Validates TLD using punycode conversion internally
- Returns original Unicode form (not punycode)
Binary Log Support
The extractor can find ASCII patterns in binary data:
#![allow(unused)] fn main() { let mut binary_log = Vec::new(); binary_log.extend_from_slice(b"Log: "); binary_log.push(0xFF); // Invalid UTF-8 binary_log.extend_from_slice(b" evil.com "); for match_item in extractor.extract_from_line(&binary_log) { println!("Found in binary: {}", match_item.as_str(&binary_log)); } // Output: // Found in binary: evil.com }
This is useful for scanning:
- Binary protocol logs
- Corrupted text files
- Mixed encoding logs
Performance
The extractor is highly optimized:
- Throughput: 200-500 MB/sec on typical log files
- SIMD acceleration: Uses
memchr
for byte scanning - Zero-copy: No string allocation until match
- Lazy UTF-8 validation: Only validates matched patterns
Performance Tips
-
Disable unused extractors to reduce overhead:
#![allow(unused)] fn main() { let extractor = PatternExtractor::builder() .extract_ipv4(true) // Only extract IPv4 .extract_ipv6(true) // Only extract IPv6 .extract_domains(false) .extract_emails(false) .build()?; }
-
Process line-by-line for better memory usage:
#![allow(unused)] fn main() { for line in BufReader::new(file).lines() { for match_item in extractor.extract_from_line(line?.as_bytes()) { // Process match } } }
-
Use byte slices to avoid UTF-8 conversion:
#![allow(unused)] fn main() { // Fast: no UTF-8 validation on whole line extractor.extract_from_line(line_bytes) // Slower: validates entire line as UTF-8 first extractor.extract_from_line(line_str.as_bytes()) }
CLI Integration
The matchy match
command uses the extractor internally:
# Scan logs for threats (outputs JSON to stdout)
matchy match threats.mxy access.log
# Each match is a JSON line:
# {"timestamp":"123.456","line_number":1,"matched_text":"evil.com","match_type":"pattern",...}
# {"timestamp":"123.789","line_number":2,"matched_text":"1.2.3.4","match_type":"ip",...}
# Show statistics (to stderr)
matchy match threats.mxy access.log --stats
# Statistics output (stderr):
# [INFO] Lines processed: 15,234
# [INFO] Lines with matches: 127 (0.8%)
# [INFO] Throughput: 450.23 MB/s
See matchy match for CLI details.
Examples
Complete working examples:
examples/extractor_demo.rs
: Demonstrates all extraction featuressrc/bin/matchy.rs
: Seecmd_match()
for CLI implementation
Run the demo:
cargo run --release --example extractor_demo
Summary
- High performance: 200-500 MB/sec throughput
- SIMD-accelerated: Fast pattern finding
- Unicode support: Handles international domains
- Binary logs: Extracts ASCII from non-UTF-8
- Zero-copy: Efficient memory usage
- Configurable: Customize extraction behavior
Pattern extraction makes it easy to scan large log files and find security indicators.