Dataset Comparison App

A TypeScript/Bun application that compares two datasets using OpenAI's LLM to identify semantically similar entries and highlight differences.

Features

Semantic Similarity: Uses OpenAI's text-embedding-3-small model to find the most similar entries between datasets
Diff Summaries: Generates human-readable difference summaries using GPT-3.5-turbo
Batching & Rate Limiting: Handles API rate limits with intelligent batching
Detailed Reports: Outputs comprehensive JSON reports with similarity scores and diff summaries
TypeScript: Full type safety with proper interfaces and type checking
Bun Runtime: Fast execution with built-in TypeScript support

Requirements

Bun runtime (v1.0.0 or higher)
OpenAI API key

Setup

Install Bun (if not already installed):

curl -fsSL https://bun.sh/install | bash

Install dependencies:
```
bun install
```
Set up OpenAI API key: Create a .env file in the project root:
```
OPENAI_API_KEY=your_openai_api_key_here
```
Ensure datasets are present: Make sure datasetA.json and datasetB.json are in the project root directory.

Usage

Run the comparison:

bun start

The application will:

Load both datasets
Generate embeddings for all entries
Find the best semantic matches
Generate diff summaries using LLM
Save results to comparison_report.json

Dataset Format

Each dataset should be a JSON array of objects with the following structure:

[
  {
    "id": 12345,
    "name": "John Doe",
    "title": "Software Engineer",
    "summary": "Experienced developer with...",
    "skills": ["JavaScript", "Python", "React"]
  }
]

Output Format

The generated comparison_report.json contains:

{
  "comparisonDate": "2024-01-15T10:30:00.000Z",
  "totalComparisons": 100,
  "results": [
    {
      "entryA": {
        /* Original entry from datasetA */
      },
      "matchedEntryB": {
        /* Best match from datasetB */
      },
      "similarityScore": 0.8745,
      "diffSummary": "Skills updated, title changed from 'Dev' to 'Sr. Dev'"
    }
  ]
}

Type Definitions

The application uses TypeScript interfaces for type safety:

DatasetEntry: Core dataset entry structure
DatasetEntryWithEmbedding: Entry with embedding vector
ComparisonResult: Result of comparing two entries
ComparisonReport: Complete comparison report structure
MatchResult: Result of finding best match

Configuration

BATCH_SIZE: Number of entries processed in parallel (default: 10)
RATE_LIMIT_DELAY: Delay between batches in milliseconds (default: 1000)
Embedding Model: text-embedding-3-small (OpenAI)
Chat Model: gpt-4.1-mini (OpenAI)

Error Handling

The application includes comprehensive error handling for:

Missing API keys
Dataset loading failures
API rate limit issues
Embedding generation errors
File system operations

Performance Notes

Uses cosine similarity for efficient vector comparison
Implements batching to respect OpenAI's rate limits
Progress tracking for long-running operations
Memory-efficient processing of large datasets
Bun's fast runtime for improved performance

Development

For development with TypeScript type checking:

bun run dev

The project includes a tsconfig.json optimized for Bun with strict type checking enabled.

capaj/llm-task