File location and inconsistency in data attribute order
JohnPFL opened this issue · 7 comments
Hello,
I trust this message finds you well. I've encountered a couple of challenges while working with the provided code, and I wanted to bring them to your attention. Here are the main issues:
-
File Location: I'm having difficulty locating the file necessary to generate the main/prepare_data.sh dataset within the repository. Could you provide guidance on where to find this file?
-
Data Loading and Task Specification: In an effort to contribute, I took the initiative to implement data loading from Hugging Face (for the evaluation phase). However, I observed a discrepancy in the task specification within the
eval_all.py
function:[ (*x, y) for x, y in zip(data_multi_all, data_embd) if x[3] == "human_similarity" ]
Here, the purpose information is specified in the 4th position. Upon further analysis of the
evaluate_human_similarity
function:def evaluate_human_similarity(data_multi_hs): tok_to_embd = {} for (token_ort, token_ipa, lang, pronunciation, purpose, embd) in data_multi_hs: tok_to_embd[token_ort] = embd
I noticed that the purpose information is expected in the 5th position. However, this caused a bug for me, and I had to adjust it from
x[3]
tox[4]
to resolve the issue. I'd appreciate your insights on this matter, as I want to ensure I'm not overlooking any crucial details.Another problem I noticed is that the expected order from these functions is different from the order given by the huggingface dataset:
huggingface order:
['token_ort', 'token_ipa', 'token_arp', 'lang', 'purpose']
evaluation order:
(token_ort, token_ipa, lang, pronunciation, purpose, embd)
To address this, I've adjusted my preprocessing steps to align with the expected order. However, this solution is inconvenient and introduces inconsistency in the codebase.
I wanted to bring this to your attention and seek for your opinion on a more sustainable resolution.
Thank you for your time and assistance.
Best regards
- Another thing that is not particularly clear to me is the following. Should the embeddings be provided at a word-based level or at a character/phonetic character-based level? If I have different embeddings for each individual phoneme, should I take a general average/ a mean pooling layer to obtain a single phoneme for each single word, or is it more correct to leave everything as it is?
I am using PanPhon as embeddings to have a simple baseline and test your suite, but even with other, more complex methods, it is a question that I find interesting.
Best regards.
- Similar to the first issue, I cannot find
"data/vocab/ipa_multi.txt"
anywhere in this repository. Am I getting something wrong?
Best regards.
This is the part of the code with which I am having issues:
def get_analogies(data, lang):
os.makedirs("data/cache/", exist_ok=True)
CACHE_PATH = f"data/cache/analogies_{lang}.pkl"
if os.path.exists(CACHE_PATH):
with open(CACHE_PATH, "rb") as f:
return pickle.load(f)
with open(f"data/vocab/ipa_multi.txt") as f:
vocab_ipa_multi = f.read().split()
Hi! Thank you again for your interest. Since you raised several problems it would be best to open several issues on GitHub. I'll try my best to answer them in sequence.
- The file
main/prepare_data.sh
was refactored intocreate_dataset/all.sh
. I fixed the README since then. Ideally you should use the version that's on Huggingface though.
- I just pushed the vocab files into the repository. However, they could also be reconstructed by just taking all the unique characters from the public dataset.
- Primarily we are interested in word-level phonetic embeddings. However, in the paper we evaluate also some phoneme-level phonetic embedding baseline which is essentially an average of embeddings of phonemes.
-
I agree that it's confusing that the order changes. I added a new script that just downloads and formats the huggingface version which should be compatible with the codebase. There is no need to recreate it yourself.
The following should replicate thedata/multi.tsv
for the evaluation. Let me know if that resolves your issue.python3 create_dataset/download_huggingface.py
I'm closing this big issue but let's follow-up by opening new individual ones if there are still some persisting problems! 🙂