Entity Disambiguation with POTION Multilingual Model

This project implements and evaluates an entity disambiguation system using the POTION multilingual-128M static embedding model. The goal is to accurately match person name queries to a list of entities, handling various real-world challenges like typos, case variations, partial names, and middle names.

Overview
Implementation Approaches
Key Features
Performance Results
Installation
Usage
Technical Details

Overview

Entity disambiguation is the task of determining which entity a query refers to from a list of candidates. For example:

Query: "John" -> Should match all people named John
Query: "John Smith" -> Should match all John Smiths (including "John Michael Smith")
Query: "Jhon Smith" -> Should handle the typo and match John Smith

Implementation Approaches

1. Baseline: Original POTION Model

Uses semantic embeddings directly
Good for semantic queries ("Software Engineer" -> finds engineers)
Poor for exact string matching and case sensitivity
No typo handling

2. Hybrid Approach (First Attempt)

Added exact string matching and fuzzy matching
Improved case handling
Failed on partial name queries (e.g., "John" returned only 1 result instead of all Johns)

3. Improved Approach (Final Solution)

Query type detection (partial name vs full name vs semantic)
First/last name extraction and matching
Middle name and initial support
Dynamic thresholds based on query type
Maintains semantic search capabilities

Key Features

1. Partial Name Matching

# Query: "John" returns all Johns
John Smith - Software Engineer (0.950)
John Michael Smith - Professor (0.950)
John Doe - Data Scientist (0.950)
John Williams - Composer (0.950)

2. Middle Name Handling

# Query: "John Smith" matches all John Smiths
John Smith - Professor (1.000)
John Michael Smith - Engineer (0.950)  # Matches despite middle name
John M. Smith - Scientist (0.950)      # Matches with initial

3. Initial Support

# Query: "J" returns all J names
# Query: "John M. Smith" matches "John Michael Smith" (0.960)
# Query: "M" finds middle names/initials with M

4. Typo Tolerance

# Query: "Jhon Smith" -> John Smith (0.892)
# Query: "Micheal Johnson" -> Michael Johnson (0.943)

5. Case Insensitivity

# All return the same results:
"John Smith" / "john smith" / "JOHN SMITH" -> 0.950

Performance Results

Overall Metrics Comparison (With Middle Name Test Cases)

Metric	Original POTION	Improved Approach	Improvement
Overall Precision	0.507	0.726	+43.3%
Overall Recall	0.704	0.981	+39.5%
Overall F1 Score	0.589	0.835	+41.7%
Macro Precision	0.642	0.783	+22.0%
Macro Recall	0.829	0.942	+13.6%
Macro F1 Score	0.631	0.833	+32.0%
Perfect F1 Queries	4/20	11/20	+7 queries (175% increase)

Speed Performance

Model Load Time: ~3 seconds
Average Search Time: 0.15-0.26 ms per query
Throughput: ~6,700 searches per second

Key Improvements by Query Type

Query Type	Example	Original F1	Improved F1	Change
First name only	"John"	0.308	1.000	+224%
Last name only	"Smith"	0.600	1.000	+67%
Full name without middle	"John Smith"	0.600	1.000	+67%
Full name with middle	"John Michael Smith"	0.286	0.667	+133%
Initial query	"J"	0.556	0.870	+56%
Middle name query	"Michael"	0.800	0.800	No change
Name with typo	"Jhon Smith"	0.667	0.462	-31%*
Lowercase	"john smith"	0.400	1.000	+150%
Semantic	"CEO startup"	1.000	1.000	No change

*Note: Typo handling shows lower F1 due to stricter threshold preventing false positives

Middle Name Handling Performance

Query	Expected Behavior	Original	Improved
"John Smith"	Match all John Smiths (with/without middle)	7 results, F1=0.600	3 results, F1=1.000
"Jane Smith"	Match all Jane Smiths	5 results, F1=0.571	2 results, F1=1.000
"John M. Smith"	Match exact and "John Michael Smith"	6 results, F1=0.286	2 results, F1=0.667
"J"	Match all names starting with J	7 results, F1=0.556	10 results, F1=0.870

Installation

# Clone the repository
git clone <repository-url>
cd embeddings

# Install dependencies using uv
uv pip install -r pyproject.toml

Dependencies:

model2vec>=0.6.0
scikit-learn>=1.3.0
tabulate>=0.9.0
colorama>=0.4.6

Usage

Basic Usage

from entity_disambiguation_improved import ImprovedEntityDisambiguator

# Initialize the disambiguator
disambiguator = ImprovedEntityDisambiguator()

# Define your entities
entities = [
    {"id": "1", "descriptor": "John Smith - Software Engineer at Google"},
    {"id": "2", "descriptor": "John Michael Smith - Professor at MIT"},
    {"id": "3", "descriptor": "Jane Smith - Product Manager at Apple"},
]

# Create embeddings
entity_embeddings = disambiguator.create_entity_embeddings(entities)

# Search for entities
query = "John Smith"
results, search_time, match_type = disambiguator.search(
    query, entities, entity_embeddings, threshold=0.5
)

# Display results
for result in results:
    print(f"{result['descriptor']} (score: {result['similarity']:.3f})")

Running Tests

# Test basic functionality
uv run python entity_disambiguation_improved.py

# Test middle name handling
uv run python test_improved_middle_names.py

# Run full evaluation
uv run python evaluate_improved.py

# Generate comparison tables
uv run python final_comparison_table.py

Technical Details

Scoring System

The improved approach uses a hierarchical scoring system:

1.000 - Exact full name match
0.980 - Full name with matching middle names
0.960 - Name with middle initial matching
0.950 - First+Last match (ignoring middle) or exact first/last name
0.900 - Exact middle name match
0.850 - First/last initial match
0.800 - Middle initial match
< 0.800 - Semantic similarity (with penalties for partial names)

Query Type Detection

The system automatically detects query types:

Single word -> Partial name search (first/last name or initial)
Two capitalized words -> Full name search
Multiple words with non-names -> Semantic search

Thresholds

Partial names (single word): threshold = 0.5
Full names and semantic: threshold = 0.4

These thresholds help reduce false positives while maintaining high recall.

Architecture

graph TD
    A[ImprovedEntityDisambiguator] --> B[Query Type Detection]
    A --> C[Name Part Extraction]
    A --> D[Matching Strategies]
    A --> E[Score Combination]
    
    C --> C1[First name]
    C --> C2[Middle names]
    C --> C3[Last name]
    C --> C4[Initials]
    
    D --> D1[Exact matching]
    D --> D2[Initial matching]
    D --> D3[Fuzzy matching for typos]
    D --> D4[Semantic matching]

Future Improvements

Nickname handling: "Bob" -> "Robert"
Phonetic matching: Handle similar-sounding names
Cultural name variations: Handle different name orders
Compound last names: Better handling of hyphenated names
Title extraction: Separate titles (Dr., Prof.) from names

Final Results Summary

Comprehensive Evaluation Results (20 test cases including middle names)

The improved approach shows significant performance gains over the baseline POTION model:

Metric	Baseline	Improved	Gain
Overall F1	0.589	0.835	+41.7%
Macro F1	0.631	0.833	+32.0%
Perfect Queries	4/20 (20%)	11/20 (55%)	+175%

Key Achievements

Partial Name Matching: "John" query improved from F1=0.308 to F1=1.000 (+224%)
Middle Name Support: "John Smith" now correctly matches all John Smiths (with/without middle names)
Initial Handling: Single letter queries like "J" work as initial searches
Case Insensitivity: Perfect handling of case variations
Speed: Maintains fast search times (~0.2ms per query)

Where the Improved Approach Excels

✅ Perfect F1 (1.0) achieved for:

All partial name queries (first/last names)
Full names without middle names
Case variations
Most semantic queries

✅ Major improvements for:

Names with middle names/initials
Single letter (initial) queries
Complex entity lists with name variations

Model Comparison: POTION vs MiniLM

Overall Performance Comparison

Model	Approach	F1 Score	Precision	Recall	Perfect Queries	Load Time	Search Speed
POTION	Baseline	0.649	0.558	0.774	4/12 (33%)	3.68s	0.16ms
POTION	Improved	0.870	0.789	0.968	8/12 (67%)	2.25s	0.01ms
MiniLM	Baseline	0.678	0.714	0.645	5/12 (42%)	3.96s	6.64ms
MiniLM	Improved	0.923	0.882	0.968	8/12 (67%)	3.20s	0.01ms

Key Findings

The improved approach benefits both models equally - Both achieve ~35% F1 improvement
POTION is surprisingly competitive - Only 5.3% lower F1 than MiniLM while being a static model
Speed advantage of POTION - 664x faster baseline search than MiniLM
MiniLM excels at semantic queries - Better for complex semantic searches
Both models achieve identical recall (0.968) with the improved approach

Performance by Query Type

Query Type	POTION Baseline	POTION Improved	MiniLM Baseline	MiniLM Improved
Partial names (John)	Poor	Perfect	Moderate	Perfect
Full names	Moderate	Perfect	Good	Perfect
Middle names	Poor	Good	Poor	Good
Typos	Poor	Good	Moderate	Good
Case variations	Poor	Perfect	Good	Perfect
Semantic search	Good	Good	Excellent	Excellent

Recommendation

Use POTION Improved for production systems where speed and resource usage matter
Use MiniLM Improved when maximum accuracy is critical and resources are available
The improved approach is essential regardless of the embedding model chosen

Conclusion

The improved entity disambiguation system achieves significant improvements over baseline semantic-only approaches:

POTION: 41.7% improvement in F1 score (0.589 → 0.835)
MiniLM: 36.2% improvement in F1 score (0.678 → 0.923)

With 55-67% of queries achieving perfect F1 scores (vs 20-42% for baselines), the improved approach successfully handles:

Partial name queries with perfect recall
Middle names and initials
Typos and case variations
Semantic queries

The combination of exact matching, fuzzy matching, and semantic embeddings provides robust entity disambiguation while maintaining sub-millisecond search times. POTION offers an excellent balance of performance and efficiency, while MiniLM provides the highest accuracy when resources permit.

reactivetype/potion-matching