A smart log processing pipeline where logs โ regardless of source, structure, or format โ are:
โ
Automatically analyzed and understood
๐ง Matched against known or discovered structures
๐ฆ Converted into clean JSON for downstream use (RAG, dashboards, alerts)
๐ Continuously improved by learning from what it fails to parse
Status: โ Implemented
- Uses manually defined regex patterns for known formats (Apache, Syslog, SSH, etc.)
- Converts matching log lines into
JSONL - Logs that do not match are skipped and stored separately
Goal: Track all unmatched lines for improvement
Features:
- Saves unparsed lines to
SkippedLogs/ - Records file name and line number for traceability
- Enables continuous learning and correction
Goal: Dynamically extract structure from unknown log formats using open-source LLMs like Mistral, Gemma, or LLaMA3.
Steps:
- Pass skipped lines to an LLM with a prompt like:
You are a log analysis assistant. Given the following log line, extract: - timestamp - level - message Return the output as JSON. - Cache and validate LLM outputs
- Add to training or deployable pattern bank
Benefits:
- Removes the need for new regexes
- Handles unstructured, unknown, or mixed-format logs
Goal: Automatically learn templates and clusters from logs
Features:
- Use Drain3 to:
- Discover static and dynamic fields
- Group logs into clusters
- Mine templates like
User * logged in from *
- Store mined templates for downstream use or learning
- Use clustering insights to guide new pattern or anomaly detection
Goal: Build a self-improving parser system
How:
- Reprocess skipped lines periodically
- Generate new patterns from LLM or Drain3
- Validate outputs with scoring or confidence thresholds
- Add verified patterns to
live_parser_patterns.json
| Feature | Description |
|---|---|
| ๐งช Accuracy scoring | Manual or LLM-assisted evaluation |
| ๐ง Confidence thresholds | Auto-accept LLM outputs above threshold |
| ๐ Parsing dashboard | Visualize logs parsed, templates learned, anomalies |
| ๐ Secure fine-tuning | Handle PII-sensitive logs privately |
| ๐ฌ RAG-based querying | Ask questions from logs via embedded vector DB |
graph TD
A[Raw Logs] --> B[Regex-based Parser]
B -->|Parsed| C[JSONL Logs]
B -->|Skipped| D[SkippedLogs/]
D --> E[LLM Analysis & Labeling]
D --> F[Drain3 Template Mining]
E --> G[Auto-Generated Patterns]
F --> G
G --> H[Updated Parser Patterns]
H --> B
C --> I[RAG / Vector DB]
log-parser-intelligent/
โโโ logs/ # Raw input logs
โโโ ParsedLogs/ # Parsed JSONL files
โโโ SkippedLogs/ # Unmatched logs with trace info
โโโ Anomalies/ # Drain3-flagged anomalies
โโโ Patterns/
โ โโโ live_parser_patterns.json
โ โโโ learned_templates.json
โโโ llm_prompts/
โ โโโ log_schema_extraction.txt
โโโ vectorstore/ # For RAG embeddings
โโโ drain3_snapshot.json # Template cluster snapshot
โโโ README.md # This file
- Clone this repo
- Install dependencies:
pip install drain3 openai chromadb
- Run the multi-parser:
python parse_logs.py --input ./logs --output ./ParsedLogs
- Run LLM-assist:
python enrich_with_llm.py --input ./SkippedLogs --output ./ParsedLogs
Want to add new patterns, LLM prompt styles, or vector search capabilities?
Feel free to fork and raise a PR.
- Drain3
- ChromaDB
- Open-source LLMs: Mistral / Gemma / LLaMA3 via Ollama
- Inspired by real-world log intelligence & observability challenges
Feel free to connect for ideas, issues or collaborations:
- Maintainer: @mrsahiljaiswal
- Email:
sahiljaiswal757@gmail.com(Replace with your real contact)