A TypeScript/Bun application that compares two datasets using OpenAI's LLM to identify semantically similar entries and highlight differences.
- Semantic Similarity: Uses OpenAI's text-embedding-3-small model to find the most similar entries between datasets
- Diff Summaries: Generates human-readable difference summaries using GPT-3.5-turbo
- Batching & Rate Limiting: Handles API rate limits with intelligent batching
- Detailed Reports: Outputs comprehensive JSON reports with similarity scores and diff summaries
- TypeScript: Full type safety with proper interfaces and type checking
- Bun Runtime: Fast execution with built-in TypeScript support
- Bun runtime (v1.0.0 or higher)
- OpenAI API key
-
Install Bun (if not already installed):
curl -fsSL https://bun.sh/install | bash
-
Install dependencies:
bun install
-
Set up OpenAI API key: Create a
.env
file in the project root:OPENAI_API_KEY=your_openai_api_key_here
-
Ensure datasets are present: Make sure
datasetA.json
anddatasetB.json
are in the project root directory.
Run the comparison:
bun start
The application will:
- Load both datasets
- Generate embeddings for all entries
- Find the best semantic matches
- Generate diff summaries using LLM
- Save results to
comparison_report.json
Each dataset should be a JSON array of objects with the following structure:
[
{
"id": 12345,
"name": "John Doe",
"title": "Software Engineer",
"summary": "Experienced developer with...",
"skills": ["JavaScript", "Python", "React"]
}
]
The generated comparison_report.json
contains:
{
"comparisonDate": "2024-01-15T10:30:00.000Z",
"totalComparisons": 100,
"results": [
{
"entryA": {
/* Original entry from datasetA */
},
"matchedEntryB": {
/* Best match from datasetB */
},
"similarityScore": 0.8745,
"diffSummary": "Skills updated, title changed from 'Dev' to 'Sr. Dev'"
}
]
}
The application uses TypeScript interfaces for type safety:
DatasetEntry
: Core dataset entry structureDatasetEntryWithEmbedding
: Entry with embedding vectorComparisonResult
: Result of comparing two entriesComparisonReport
: Complete comparison report structureMatchResult
: Result of finding best match
- BATCH_SIZE: Number of entries processed in parallel (default: 10)
- RATE_LIMIT_DELAY: Delay between batches in milliseconds (default: 1000)
- Embedding Model: text-embedding-3-small (OpenAI)
- Chat Model: gpt-4.1-mini (OpenAI)
The application includes comprehensive error handling for:
- Missing API keys
- Dataset loading failures
- API rate limit issues
- Embedding generation errors
- File system operations
- Uses cosine similarity for efficient vector comparison
- Implements batching to respect OpenAI's rate limits
- Progress tracking for long-running operations
- Memory-efficient processing of large datasets
- Bun's fast runtime for improved performance
For development with TypeScript type checking:
bun run dev
The project includes a tsconfig.json
optimized for Bun with strict type checking enabled.