obss/jury

ValueError: unmarshallable object

AI-14 opened this issue · 5 comments

Describe the bug
While running experiments on my project, I got this error when I evaluate the results using jury.

Expected behavior
I expected it to give me results but I don't know why this happened. I tried on many virtual machines and same error pops up.

Exception Traceback (if available)
The output from the terminal is as follows:

Traceback (most recent call last):
  File "/workspace/cr2gllm-off/train_test_sft.py", line 158, in <module>
    main()
  File "/workspace/cr2gllm-off/train_test_sft.py", line 154, in main
    test(args)
  File "/workspace/cr2gllm-off/train_test_sft.py", line 144, in test
    calculate_metrics([df["prediction_sft"].to_list()], [df["findings"].to_list()])
  File "/workspace/cr2gllm-off/utils.py", line 183, in calculate_metrics
    bleu1 = bleu.compute(predictions=predictions, references=references, max_order=1)
  File "/opt/conda/envs/env/lib/python3.10/site-packages/evaluate/module.py", line 467, in compute
    output = self._compute(**inputs, **compute_kwargs)
  File "/opt/conda/envs/env/lib/python3.10/site-packages/jury/metrics/_core/base.py", line 322, in _compute
    result = self.evaluate(predictions=predictions, references=references, reduce_fn=reduce_fn, **eval_params)
  File "/opt/conda/envs/env/lib/python3.10/site-packages/jury/metrics/bleu/bleu_for_language_generation.py", line 277, in evaluate
    return super().evaluate(predictions=predictions, references=references, reduce_fn=reduce_fn, **kwargs)
  File "/opt/conda/envs/env/lib/python3.10/site-packages/jury/metrics/_core/base.py", line 276, in evaluate
    return eval_fn(predictions=predictions, references=references, **kwargs)
  File "/opt/conda/envs/env/lib/python3.10/site-packages/jury/metrics/bleu/bleu_for_language_generation.py", line 237, in _compute_multi_pred_multi_ref
    score = self._compute_single_pred_multi_ref(
  File "/opt/conda/envs/env/lib/python3.10/site-packages/jury/metrics/bleu/bleu_for_language_generation.py", line 198, in _compute_single_pred_multi_ref
    return self._compute_bleu_score(
  File "/opt/conda/envs/env/lib/python3.10/site-packages/jury/metrics/bleu/bleu_for_language_generation.py", line 158, in _compute_bleu_score
    evaluation_fn = self._get_external_resource("nmt_bleu", attr="compute_bleu")
  File "/opt/conda/envs/env/lib/python3.10/site-packages/jury/metrics/_core/base.py", line 190, in _get_external_resource
    external_module = import_module(module_name, self.external_module_path)
  File "/opt/conda/envs/env/lib/python3.10/site-packages/jury/metrics/_core/utils.py", line 52, in import_module
    spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 879, in exec_module
  File "<frozen importlib._bootstrap_external>", line 1026, in get_code
  File "<frozen importlib._bootstrap_external>", line 689, in _code_to_timestamp_pyc
ValueError: unmarshallable object

Environment Information:

  • OS: Ubuntu 22.04
  • jury version: jury==2.3.1
  • python: 3.10
  • conda env was used

Hi @AI-14, can you please provide the minimal code example to reproduce the error ? For the data part, you can put some toy reference and predictions.

Code snippet I used:

from jury import metrics
import pandas as pd

def calculate_metrics(
    predictions: list[list[str]], references: list[list[str]]
) -> None:
    """Computes NLG metrics.

    Args:
        predictions (list[list[str]]): Containing all the predicted sentences.
        references (list[list[str]]): Containing all the ground truth sentences.
    """

    bleu = metrics.Bleu.construct()
    bleu1 = bleu.compute(predictions=predictions, references=references, max_order=1)
    bleu2 = bleu.compute(predictions=predictions, references=references, max_order=2)
    bleu3 = bleu.compute(predictions=predictions, references=references, max_order=3)
    bleu4 = bleu.compute(predictions=predictions, references=references, max_order=4)
    meteor = metrics.Meteor.construct().compute(
        predictions=predictions, references=references
    )
    rouge = metrics.Rouge.construct().compute(
        predictions=predictions, references=references
    )

    print(f"BLEU-1: {bleu1['bleu']['score']:.3f}")
    print(f"BLEU-2: {bleu2['bleu']['score']:.3f}")
    print(f"BLEU-3: {bleu3['bleu']['score']:.3f}")
    print(f"BLEU-4: {bleu4['bleu']['score']:.3f}")
    print(f"METEOR: {meteor['meteor']['score']:.3f}")
    print(f"ROUGE-L: {rouge['rouge']['rougeL']:.3f}")

print("Results of llm-sft:")
df = pd.read_csv(f"{args.results_dir}/sft_predictions.csv", encoding="utf-8")
calculate_metrics([df["prediction_sft"].to_list()], [df["findings"].to_list()])

Data looks like this:

prediction_sft column: There is a mild pneumothorax on the right side of the chest, along with consolidation in the left lower lobe of the lung, likely due to infection. Additionally, there is bilateral pleural effusion, possibly related to heart failure or other conditions. Overall, the chest x-ray shows no acute cardiopulmonary abnormalities or processes.

findings column: heart size and mediastinal contour are normal. pulmonary vascularity is normal. lungs are clear. no pleural effusions or pneumothoraces.

requirements.txt file

bitsandbytes==0.43.1
jury==2.3.1
numpy==1.26.4
pandas==2.2.2
peft==0.10.0
Pillow==10.4.0
scikit_learn==1.3.0
--extra-index-url https://download.pytorch.org/whl/cu121
torch==2.1.1
torchvision==0.16.1
tqdm==4.65.0
transformers==4.36.2
trl==0.8.6
gdown
tensorboard

NOTE: The error pops-up when I'm using Ubuntu22.04. It never appears on Windows 11.

Hi @AI-14, I've run and tried your code in colab, and it didn't throw an error.

The following code snippet, with jury=2.3.1 worked without an error.

from jury import metrics
import pandas as pd

def calculate_metrics(
    predictions: list[list[str]], references: list[list[str]]
) -> None:
    """Computes NLG metrics.

    Args:
        predictions (list[list[str]]): Containing all the predicted sentences.
        references (list[list[str]]): Containing all the ground truth sentences.
    """

    bleu = metrics.Bleu.construct()
    bleu1 = bleu.compute(predictions=predictions, references=references, max_order=1)
    bleu2 = bleu.compute(predictions=predictions, references=references, max_order=2)
    bleu3 = bleu.compute(predictions=predictions, references=references, max_order=3)
    bleu4 = bleu.compute(predictions=predictions, references=references, max_order=4)
    meteor = metrics.Meteor.construct().compute(
        predictions=predictions, references=references
    )
    rouge = metrics.Rouge.construct().compute(
        predictions=predictions, references=references
    )

    print(f"BLEU-1: {bleu1['bleu']['score']:.3f}")
    print(f"BLEU-2: {bleu2['bleu']['score']:.3f}")
    print(f"BLEU-3: {bleu3['bleu']['score']:.3f}")
    print(f"BLEU-4: {bleu4['bleu']['score']:.3f}")
    print(f"METEOR: {meteor['meteor']['score']:.3f}")
    print(f"ROUGE-L: {rouge['rouge']['rougeL']:.3f}")

mt_predictions = [
    ["the cat is on the mat", "There is cat playing on the mat"], 
    ["Look! a wonderful day."]
]
mt_references = [
    ["the cat is playing on the mat.", "The cat plays on the mat."],
    ["Today is a wonderful day", "The weather outside is wonderful."],
]

print("Results of llm-sft:")
# df = pd.read_csv(f"{args.results_dir}/sft_predictions.csv", encoding="utf-8")
calculate_metrics(mt_predictions, mt_references)

The output is as follows:

Results of llm-sft:
BLEU-1: 0.882
BLEU-2: 0.753
BLEU-3: 0.636
BLEU-4: 0.424
METEOR: 0.727
ROUGE-L: 0.743

Can you please double check or share a notebook ?

Btw, as a side note, your calculate_metrics contain low level construct calls. Metric construction shouldn't be done this way. Ideally, one should use factory/helper functions to get metrics like load_metric. Please refer to the README for this.

Hi @devrimcavusoglu, I double checked the code, it runs fine in collab too. It runs fine everywhere except on Ubuntu22.04. I think I'll change the VM to windows itself as it runs fine on Windows11. Thank you for responding!

@AI-14 good to hear. I will be double checking on Ubuntu 22.04 if i find time, but closing this issue for now as the problem is resolved.