yandex-research/DeDLOC

Problems when trying to run the albert example

soodoshll opened this issue · 3 comments

Hi! Thank you for this amazing project! I'm trying to reproduce the experiment result in the paper but encountered some questions:

I'm using python3.9 and following the instructions in the readme file.

1. Data pre-processing

When I tried to run the command python tokenize_wikitext103.py, it shows an error message like:

Traceback (most recent call last):                                                                                                       
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker                           
    result = (True, func(*args, **kwds))                                                                                                 
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 518, in wrapper                     
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                                                                   
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 485, in wrapper                     
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                                                                   
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/fingerprint.py", line 411, in wrapper                       
    out = func(self, *args, **kwargs)                                                                                                    
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2469, in _map_single                
    batch = apply_function_on_filtered_inputs(                                                                                           
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2357, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)                                                                 
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2052, in decorated                  
    result = f(decorated_item, *args, **kwargs)                                                                                          
  File "/home/su/DeDLOC/albert/tokenize_wikitext103.py", line 82, in tokenize_function                                                   
    instances = create_instances_from_document(tokenizer, text, max_seq_length=512)                                                      
  File "/home/su/DeDLOC/albert/tokenize_wikitext103.py", line 24, in create_instances_from_document                                      
    segmented_sents = list(nltk.sent_tokenize(document))                                                                                 
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 107, in sent_tokenize               
    return tokenizer.tokenize(text)                                                                                                      
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1276, in tokenize                      
    return list(self.sentences_from_text(text, realign_boundaries))                                                                      
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in sentences_from_text           
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]                                                          
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1332, in <listcomp>                    
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]                                                          
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1322, in span_tokenize                 
    for sentence in slices:                                                                                                              
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1421, in _realign_boundaries
    for sentence1, sentence2 in _pair_iter(slices):
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 318, in _pair_iter
    prev = next(iterator)
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1395, in _slices_from_text
    for match, context in self._match_potential_end_contexts(text):
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1382, in _match_potential_end_contexts
    before_words[match] = split[-1]
IndexError: list index out of range

2. the API URL does not exist

When I tried to run the GPU trainer, it shows this error message:

Traceback (most recent call last):
  File "/home/su/DeDLOC/albert/run_trainer.py", line 297, in <module>
    main()
  File "/home/su/DeDLOC/albert/run_trainer.py", line 225, in main
    tokenizer = AlbertTokenizerFast.from_pretrained(dataset_args.tokenizer_path, cache_dir=dataset_args.cache_dir)
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1654, in from_pretrained
    fast_tokenizer_file = get_fast_tokenizer_file(
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 3486, in get_fast_tokenizer_file
    all_files = get_list_of_files(
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/transformers/file_utils.py", line 2103, in get_list_of_files
    return list_repo_files(path_or_repo, revision=revision, token=token)
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 602, in list_repo_files
    info = self.model_info(
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/huggingface_hub/hf_api.py", line 586, in model_info
    r.raise_for_status()
  File "/home/su/miniconda3/envs/albert/lib/python3.9/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/data/tokenizer

I downgraded nltk to 3.6.2 and the first problem is solved.

Thanks, Alexander. That solves the problem.