Problems when trying to run the albert example
soodoshll opened this issue · 3 comments
soodoshll commented
Hi! Thank you for this amazing project! I'm trying to reproduce the experiment result in the paper but encountered some questions:
I'm using python3.9 and following the instructions in the readme file.
1. Data pre-processing
When I tried to run the command python tokenize_wikitext103.py
, it shows an error message like:
Traceback (most recent call last):
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 518, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 485, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/fingerprint.py", line 411, in wrapper
out = func(self, *args, **kwargs)
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2469, in _map_single
batch = apply_function_on_filtered_inputs(
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2357, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2052, in decorated
result = f(decorated_item, *args, **kwargs)
File "/home/su/DeDLOC/albert/tokenize_wikitext103.py", line 82, in tokenize_function
instances = create_instances_from_document(tokenizer, text, max_seq_length=512)
File "/home/su/DeDLOC/albert/tokenize_wikitext103.py", line 24, in create_instances_from_document
segmented_sents = list(nltk.sent_tokenize(document))
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 107, in sent_tokenize
return tokenizer.tokenize(text)
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1276, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in <listcomp>
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1322, in span_tokenize
for sentence in slices:
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1421, in _realign_boundaries
for sentence1, sentence2 in _pair_iter(slices):
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 318, in _pair_iter
prev = next(iterator)
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1395, in _slices_from_text
for match, context in self._match_potential_end_contexts(text):
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1382, in _match_potential_end_contexts
before_words[match] = split[-1]
IndexError: list index out of range
2. the API URL does not exist
When I tried to run the GPU trainer, it shows this error message:
Traceback (most recent call last):
File "/home/su/DeDLOC/albert/run_trainer.py", line 297, in <module>
main()
File "/home/su/DeDLOC/albert/run_trainer.py", line 225, in main
tokenizer = AlbertTokenizerFast.from_pretrained(dataset_args.tokenizer_path, cache_dir=dataset_args.cache_dir)
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1654, in from_pretrained
fast_tokenizer_file = get_fast_tokenizer_file(
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 3486, in get_fast_tokenizer_file
all_files = get_list_of_files(
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/transformers/file_utils.py", line 2103, in get_list_of_files
return list_repo_files(path_or_repo, revision=revision, token=token)
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 602, in list_repo_files
info = self.model_info(
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 586, in model_info
r.raise_for_status()
File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/requests/models.py", line 953, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/data/tokenizer
soodoshll commented
I downgraded nltk
to 3.6.2 and the first problem is solved.
borzunov commented
Hi!
The second problem is a consequence of the first one: this (definitely
obscure) error message is shown when the script can't find the ./data
directory, which is the output of ./tokenize_wikitext103.py script. Running
this script again should help.
The seemingly unrelated error `requests.exceptions.HTTPError` is shown
because the script looks for the tokenizer with the provided name online if
it fails to find it locally.
**Note:** If you are verifying the plots/numbers reported in the paper,
you're correct to use this repository. In contrast, if your goal is to try
out collaborative training (or set up your own experiment), consider using
a newer version of the hivemind library with a newer version of the ALBERT
example from https://github.com/learning-at-home/hivemind repository. It
has many substantial improvements, including this obscure error message
fixed.
…On Sun, Dec 26, 2021, 04:27 Qidong Su ***@***.***> wrote:
I downgraded nltk to 3.6.2 and the first problem is solved.
—
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACCX7DYXRFDVVWQDTY5T2TTUSZVOJANCNFSM5KYKU4EA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
soodoshll commented
Thanks, Alexander. That solves the problem.