zouharvi/pwesuite

File location and inconsistency in data attribute order

JohnPFL opened this issue · 7 comments

Hello,

I trust this message finds you well. I've encountered a couple of challenges while working with the provided code, and I wanted to bring them to your attention. Here are the main issues:

  1. File Location: I'm having difficulty locating the file necessary to generate the main/prepare_data.sh dataset within the repository. Could you provide guidance on where to find this file?

  2. Data Loading and Task Specification: In an effort to contribute, I took the initiative to implement data loading from Hugging Face (for the evaluation phase). However, I observed a discrepancy in the task specification within the eval_all.py function:

    [
        (*x, y) for x, y in zip(data_multi_all, data_embd)
        if x[3] == "human_similarity"
    ]

    Here, the purpose information is specified in the 4th position. Upon further analysis of the evaluate_human_similarity function:

    def evaluate_human_similarity(data_multi_hs):
        tok_to_embd = {}
        for (token_ort, token_ipa, lang, pronunciation, purpose, embd) in data_multi_hs:
            tok_to_embd[token_ort] = embd

    I noticed that the purpose information is expected in the 5th position. However, this caused a bug for me, and I had to adjust it from x[3] to x[4] to resolve the issue. I'd appreciate your insights on this matter, as I want to ensure I'm not overlooking any crucial details.

    Another problem I noticed is that the expected order from these functions is different from the order given by the huggingface dataset:
    huggingface order:
    ['token_ort', 'token_ipa', 'token_arp', 'lang', 'purpose']
    evaluation order:
    (token_ort, token_ipa, lang, pronunciation, purpose, embd)
    To address this, I've adjusted my preprocessing steps to align with the expected order. However, this solution is inconvenient and introduces inconsistency in the codebase.
    I wanted to bring this to your attention and seek for your opinion on a more sustainable resolution.
    Thank you for your time and assistance.

Best regards

  1. Another thing that is not particularly clear to me is the following. Should the embeddings be provided at a word-based level or at a character/phonetic character-based level? If I have different embeddings for each individual phoneme, should I take a general average/ a mean pooling layer to obtain a single phoneme for each single word, or is it more correct to leave everything as it is?
    I am using PanPhon as embeddings to have a simple baseline and test your suite, but even with other, more complex methods, it is a question that I find interesting.
    Best regards.
  1. Similar to the first issue, I cannot find "data/vocab/ipa_multi.txt" anywhere in this repository. Am I getting something wrong?
    Best regards.
    This is the part of the code with which I am having issues:
def get_analogies(data, lang):
    os.makedirs("data/cache/", exist_ok=True)

    CACHE_PATH = f"data/cache/analogies_{lang}.pkl"
    if os.path.exists(CACHE_PATH):
        with open(CACHE_PATH, "rb") as f:
            return pickle.load(f)

    with open(f"data/vocab/ipa_multi.txt") as f:
        vocab_ipa_multi = f.read().split()

Hi! Thank you again for your interest. Since you raised several problems it would be best to open several issues on GitHub. I'll try my best to answer them in sequence.

  1. The file main/prepare_data.sh was refactored into create_dataset/all.sh. I fixed the README since then. Ideally you should use the version that's on Huggingface though.
  1. I just pushed the vocab files into the repository. However, they could also be reconstructed by just taking all the unique characters from the public dataset.
  1. Primarily we are interested in word-level phonetic embeddings. However, in the paper we evaluate also some phoneme-level phonetic embedding baseline which is essentially an average of embeddings of phonemes.
  1. I agree that it's confusing that the order changes. I added a new script that just downloads and formats the huggingface version which should be compatible with the codebase. There is no need to recreate it yourself.
    The following should replicate the data/multi.tsv for the evaluation. Let me know if that resolves your issue.

    python3 create_dataset/download_huggingface.py
    

I'm closing this big issue but let's follow-up by opening new individual ones if there are still some persisting problems! 🙂