Input Formats Reference
Technical specification of supported input formats for building Matchy databases.
Overview
Matchy supports four input formats:
- Text - Simple line-based
- CSV - Comma-separated with metadata
- JSON - Structured data
- MISP - Threat intelligence format
All formats support mixing IPs, patterns, and exact strings.
Text Format
Specification
file = (entry | comment | blank)* ;
entry = ip | cidr | pattern | exact ;
comment = "#" .* "\n" ;
blank = "\n" ;
ip = ipv4 | ipv6 ;
ipv4 = digit{1,3} "." digit{1,3} "." digit{1,3} "." digit{1,3} ;
ipv6 = /* RFC 4291 IPv6 address */ ;
cidr = ip "/" digit{1,3} ;
pattern = .* ( "*" | "?" | "[" ) .* ;
exact = .* ;
Entry Classification
Entries are automatically classified:
- Contains
/
→ CIDR range - Valid IPv4/IPv6 → IP address
- Contains
*
,?
,[
→ Glob pattern - Otherwise → Exact string
Type Prefixes
Override auto-detection with explicit type prefixes:
Prefix | Type | Example |
---|---|---|
literal: | Exact string | literal:*.txt |
glob: | Pattern | glob:test.com |
ip: | IP/CIDR | ip:10.0.0.1 |
The prefix is automatically stripped before storage:
literal:file*.txt # Stored as exact string "file*.txt"
glob:simple.com # Stored as pattern "simple.com"
ip:192.168.1.1 # Stored as IP address 192.168.1.1
See Entry Types - Prefix Technique for details.
Examples
# IPv4 addresses
192.0.2.1
10.0.0.1
# IPv6 addresses
2001:db8::1
::1
# CIDR ranges
10.0.0.0/8
192.168.0.0/16
2001:db8::/32
# Glob patterns
*.example.com
test-*.domain.com
http://*/admin/*
[a-z]*.evil.com
# Exact strings
exact.match.com
specific-domain.com
Limitations
- No metadata support
- No per-entry JSON data
- Whitespace-only lines ignored
- UTF-8 encoding required
CLI Usage
matchy build -o output.mxy input.txt
CSV Format
Specification
file = header row* ;
header = "entry" ("," column_name)* "\n" ;
row = entry_value ("," value)* "\n" ;
Required Columns
Column | Required | Description |
---|---|---|
entry or key | Yes | IP, pattern, or exact string |
Other columns | No | Converted to JSON metadata |
Data Type Mapping
CSV Value | JSON Type |
---|---|
"text" | String |
123 | Number |
true /false | Boolean |
Empty | Null |
Examples
Simple CSV
entry,category,threat_level
192.0.2.1,malware,high
*.phishing.com,phishing,medium
exact.com,suspicious,low
Generates:
{
"192.0.2.1": {
"category": "malware",
"threat_level": "high"
}
}
Complex CSV
entry,type,score,tags,verified
10.0.0.1,botnet,95,"c2,trojan",true
*.evil.com,phishing,87,spam,false
CSV with Type Prefixes
entry,category,note
literal:test[1].txt,filesystem,Filename with brackets
glob:*.example.com,domain,Pattern match
ip:192.168.1.0/24,network,Private range
Quoting Rules
- Values with commas must be quoted:
"value,with,comma"
- Quotes inside values:
"value with ""quote"""
- Empty values allowed:
entry,,value
CLI Usage
matchy build -i csv -o output.mxy input.csv
JSON Format
Specification
// Object format (recommended)
{
"entry1": { /* metadata */ },
"entry2": { /* metadata */ },
...
}
// Array format
[
{ "entry": "entry1", /* metadata */ },
{ "entry": "entry2", /* metadata */ },
...
]
Object Format (Recommended)
Keys are entries (IPs, patterns, strings)
Values are metadata objects
{
"192.0.2.1": {
"category": "malware",
"threat_level": "high",
"first_seen": "2024-01-15",
"tags": ["botnet", "c2"]
},
"*.phishing.com": {
"category": "phishing",
"threat_level": "medium",
"verified": true
},
"10.0.0.0/8": {
"category": "internal",
"allow": true
}
}
Array Format
Each object must have entry
or key
field:
[
{
"entry": "192.0.2.1",
"category": "malware",
"score": 95
},
{
"entry": "*.evil.com",
"category": "phishing",
"score": 87
}
]
Array Format with Type Prefixes
[
{
"entry": "literal:file*.backup",
"category": "filesystem",
"note": "Match literal asterisk"
},
{
"entry": "glob:example.com",
"category": "domain",
"note": "Force pattern matching"
},
{
"entry": "ip:10.0.0.0/8",
"category": "network",
"note": "Explicit IP range"
}
]
Supported Types
JSON Type | Stored As | Notes |
---|---|---|
string | UTF-8 string | Max 64KB |
number | Float64 or Int32 | Depends on value |
boolean | Boolean | 1 byte |
null | Null marker | 1 byte |
array | Array | Nested arrays supported |
object | Map | Nested objects supported |
Nested Structures
{
"192.0.2.1": {
"threat": {
"category": "malware",
"subcategory": "trojan",
"details": {
"variant": "emotet",
"version": "3.2"
}
},
"tags": ["c2", "botnet", "high-confidence"],
"scores": {
"static": 95,
"dynamic": 87,
"reputation": 92
}
}
}
CLI Usage
matchy build -i json -o output.mxy input.json
MISP Format
Specification
Subset of MISP (Malware Information Sharing Platform) JSON format.
{
"Event": {
"Attribute": [
{
"type": "ip-dst" | "domain" | "url" | /* ... */,
"value": string,
"category": string,
"comment": string,
/* ... additional MISP fields */
}
]
}
}
Supported Attribute Types
MISP Type | Matchy Classification |
---|---|
ip-src , ip-dst | IP address |
ip-src|port , ip-dst|port | IP address (port ignored) |
domain , hostname | Exact string or pattern |
url | Pattern if contains wildcards |
email | Pattern if contains wildcards |
other | Auto-detect |
Example
{
"Event": {
"info": "Malware Campaign 2024-01",
"Attribute": [
{
"type": "ip-dst",
"value": "192.0.2.1",
"category": "Network activity",
"comment": "C2 server",
"to_ids": true
},
{
"type": "domain",
"value": "evil.example.com",
"category": "Network activity",
"comment": "Phishing domain"
},
{
"type": "url",
"value": "http://*/admin/config.php",
"category": "Payload delivery",
"comment": "Malicious URL pattern"
}
]
}
}
Metadata Extraction
MISP attributes are converted to Matchy metadata:
{
"misp_type": "ip-dst",
"misp_category": "Network activity",
"misp_comment": "C2 server",
"misp_to_ids": true
}
CLI Usage
matchy build -i misp -o output.mxy threat-feed.json
Format Comparison
Feature | Text | CSV | JSON | MISP |
---|---|---|---|---|
Metadata | ❌ | ✅ Simple | ✅ Rich | ✅ Structured |
Nested data | ❌ | ❌ | ✅ | ✅ |
Arrays | ❌ | ❌ | ✅ | ✅ |
Auto-type | ✅ | ✅ | ✅ | Partial |
Size | Smallest | Small | Medium | Large |
Readability | High | High | Medium | Low |
Standard | No | RFC 4180 | RFC 8259 | MISP spec |
Auto-Detection
By Extension
Extension | Format |
---|---|
.txt | Text |
.csv | CSV |
.json | JSON (auto-detect object vs. array) |
.misp | MISP |
By Content
If extension unknown, inspects content:
- Starts with
{
→ JSON or MISP - Starts with
[
→ JSON array - Contains
,
→ CSV - Otherwise → Text
Character Encoding
Requirement
All formats must be UTF-8 encoded.
Validation
- Automatic UTF-8 validation during build
- Invalid UTF-8 → build error
- Use
--trusted
to skip validation (unsafe)
BOM Handling
UTF-8 BOM (Byte Order Mark) is:
- Detected and skipped
- Not required
- Not preserved in database
Size Limits
Component | Limit | Notes |
---|---|---|
File size | 4GB | Total input file |
Entry key | 64KB | Single IP/pattern/string |
JSON value | 16MB | Per-entry metadata |
Entries | 4B | Total entries in database |
Error Handling
Parse Errors
$ matchy build -i csv bad.csv
Error: Parse error at line 42: Unclosed quote
Encoding Errors
$ matchy build input.txt
Error: Invalid UTF-8 at byte offset 1234
Format Errors
$ matchy build -i json bad.json
Error: Expected object or array at root
Best Practices
Choose the Right Format
- Text: Simple lists without metadata
- CSV: Tabular data with simple metadata
- JSON: Rich structured metadata
- MISP: Threat intelligence feeds
Optimize for Size
- Use text format when no metadata needed
- Avoid deeply nested JSON
- Keep metadata minimal
- Compress input files (gzip)
Validate Before Building
# Validate CSV
csv-validator input.csv
# Validate JSON
jq empty input.json
# Test build
matchy build --dry-run input.json
See Also
- Input Formats Guide - User-friendly examples
- matchy build command - Build command reference
- Database Builder API - Programmatic building
- Data Types Reference - Supported data types