Problems when trying to run the albert example

Question

Problems when trying to run the albert example

soodoshll opened this issue 3 years ago · 3 comments

Hi! Thank you for this amazing project! I'm trying to reproduce the experiment result in the paper but encountered some questions:

I'm using python3.9 and following the instructions in the readme file.

1. Data pre-processing

When I tried to run the command python tokenize_wikitext103.py, it shows an error message like:

Traceback (most recent call last):                                                                                                       
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker                           
    result = (True, func(*args, **kwds))                                                                                                 
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 518, in wrapper                     
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                                                                   
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 485, in wrapper                     
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                                                                   
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/fingerprint.py", line 411, in wrapper                       
    out = func(self, *args, **kwargs)                                                                                                    
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2469, in _map_single                
    batch = apply_function_on_filtered_inputs(                                                                                           
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2357, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)                                                                 
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2052, in decorated                  
    result = f(decorated_item, *args, **kwargs)                                                                                          
  File "/home/su/DeDLOC/albert/tokenize_wikitext103.py", line 82, in tokenize_function                                                   
    instances = create_instances_from_document(tokenizer, text, max_seq_length=512)                                                      
  File "/home/su/DeDLOC/albert/tokenize_wikitext103.py", line 24, in create_instances_from_document                                      
    segmented_sents = list(nltk.sent_tokenize(document))                                                                                 
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 107, in sent_tokenize               
    return tokenizer.tokenize(text)                                                                                                      
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1276, in tokenize                      
    return list(self.sentences_from_text(text, realign_boundaries))                                                                      
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in sentences_from_text           
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]                                                          
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in <listcomp>                    
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]                                                          
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1322, in span_tokenize                 
    for sentence in slices:                                                                                                              
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1421, in _realign_boundaries
    for sentence1, sentence2 in _pair_iter(slices):
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 318, in _pair_iter
    prev = next(iterator)
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1395, in _slices_from_text
    for match, context in self._match_potential_end_contexts(text):
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1382, in _match_potential_end_contexts
    before_words[match] = split[-1]
IndexError: list index out of range

2. the API URL does not exist

When I tried to run the GPU trainer, it shows this error message:

Traceback (most recent call last):
  File "/home/su/DeDLOC/albert/run_trainer.py", line 297, in <module>
    main()
  File "/home/su/DeDLOC/albert/run_trainer.py", line 225, in main
    tokenizer = AlbertTokenizerFast.from_pretrained(dataset_args.tokenizer_path, cache_dir=dataset_args.cache_dir)
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1654, in from_pretrained
    fast_tokenizer_file = get_fast_tokenizer_file(
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 3486, in get_fast_tokenizer_file
    all_files = get_list_of_files(
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/transformers/file_utils.py", line 2103, in get_list_of_files
    return list_repo_files(path_or_repo, revision=revision, token=token)
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 602, in list_repo_files
    info = self.model_info(
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 586, in model_info
    r.raise_for_status()
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/data/tokenizer

Answer 1 · 2021-12-26T01:26:48.000Z

I downgraded nltk to 3.6.2 and the first problem is solved.

Answer 2 · 2021-12-26T02:05:02.000Z

Hi! The second problem is a consequence of the first one: this (definitely obscure) error message is shown when the script can't find the ./data directory, which is the output of ./tokenize_wikitext103.py script. Running this script again should help. The seemingly unrelated error `requests.exceptions.HTTPError` is shown because the script looks for the tokenizer with the provided name online if it fails to find it locally. **Note:** If you are verifying the plots/numbers reported in the paper, you're correct to use this repository. In contrast, if your goal is to try out collaborative training (or set up your own experiment), consider using a newer version of the hivemind library with a newer version of the ALBERT example from https://github.com/learning-at-home/hivemind repository. It has many substantial improvements, including this obscure error message fixed.

…

On Sun, Dec 26, 2021, 04:27 Qidong Su ***@***.***> wrote: I downgraded nltk to 3.6.2 and the first problem is solved. — Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACCX7DYXRFDVVWQDTY5T2TTUSZVOJANCNFSM5KYKU4EA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Answer 3 · 2021-12-26T04:16:12.000Z

Thanks, Alexander. That solves the problem.