There are 4 main folders in this directory:
- inputs
- outputs
- plots
- src
- Contains 2 subdirectories:
- question_embeddings
- answer_embeddings
- Each subdirectory holds the corresponding embeddings in numpy (.npy) format.
- The file names in both subdirectories must be consistent to ensure the code runs correctly. Refer to the sample files provided for the correct naming convention.
- Stores all the generated results in Excel sheet format.
- Stores all the generated plots and histograms.
-
Contains the code files. There are 4 main Python scripts which can be executed individually to get the desired results:
- EmbeddingComparison.py
- BootstrappedAccuracies.py
- BootstrappedSimilarities.py
- IsotropyScores.py
-
The Python scripts are self-sufficient and automatically take input embeddings and save outputs to the required folders.
-
EmbeddingComparison.py:
- Calculates correct, random, top k, and bottom k similarities for the embeddings.
- Computes the CDF scores and saves the similarity and CDF plots in the plots folder.
- Calculates P(Correct | Top-K) and P(Random | Top-K) at the 10th and 50th percentiles with confidence intervals for k = 5 to 100.
- Stores the results in Excel files in the outputs folder.
- Plots swimlanes which are saved in the plots folder.
-
BootstrappedAccuracies.py:
- Calculates bootstrapped accuracy and full data accuracies (accuracy_df) for k = 5 to 100.
- Stores the results in Excel files in the outputs folder.
- The swimlane for bootstrapped accuracies is saved in the plots folder.
-
BootstrappedSimilarities.py:
- Calculates Similarity Threshold and the corresponding accuracy.
- Stores the results in Excel files in the outputs folder, including optimal threshold, baseline accuracy, accuracy with threshold, and percentile.
-
IsotropyScores.py:
- Calculates Isotropy scores (Isoscores), first order isotropy scores, and second order isotropy scores for all embeddings (both questions and answers).
- Stores the results in Excel format in the outputs directory.
-
The following libraries are required and should be imported in each of the Python scripts:
import numpy as np import os import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from statsmodels.distributions.empirical_distribution import ECDF from scipy import stats from scipy.stats import sem, t from scipy.stats import percentileofscore from numpy.random import default_rng import re import warnings from tqdm import tqdm import datetime from IsoScore import IsoScore