hezarai/hezar

Problem in loading the dataset using the pre-trained model

Closed this issue · 5 comments

Hello, thanks for your efforts in building this powerful library, I wanted a database completely similar to the database "hezarai/persian-license-plate-v1" I also changed other settings related to path, etc. in config files . When I try to load this dataset with pre-trained model "hezarai/crnn-fa-64x256-license-plate-recognition" (tokenizer_path), a problem occurs.
thanks.
eval_dataset = Dataset.load(dataset_path, split="test", tokenizer_path = base_model_path)

Downloading data: 100%|██████████| 14.5k/14.5k [00:00<00:00, 39.3kB/s]
Downloading data: 100%|██████████| 14.5k/14.5k [00:00<00:00, 31.2kB/s]
Downloading data: 100%|██████████| 14.5k/14.5k [00:00<00:00, 37.4kB/s]
Generating train split: 2 examples [00:00, 352.12 examples/s]
Generating validation split: 2 examples [00:00, 215.20 examples/s]
Generating test split: 2 examples [00:00, 502.10 examples/s]

135 for i, sample in enumerate(list(iter(data))):
136 path, text = sample.values()
--> 137 if len(text) <= self.config.max_length and is_text_valid(text, self.config.id2label.values()):
138 valid_indices.append(i)
[139]...../myenv/Lib/site-packages/hezar/data/datasets/ocr_dataset.py:139) else:

TypeError: object of type 'int' has no len()

Hello @Mostafa79modaqeq , thanks for the feedback ❤
As far as I notice, this error can only be caused by the fact that the order of path, text = sample.values() is reversed so that getting the len(text) would raise such error (since text is the index of the sample not the text actually).
I think this code can help you check the order of columns:

from datasets import load_dataset

data = load_dataset(dataset_path, split="test")
print(data[0])

The output must be something like below:

{'image_path': 'path/to/image.jpg', 'label': 'label_of_image'}

But yours is probably in reverse order or completely different.

Note that you can also use your own custom dataset class so that everything is in your control. See this example below:

from hezar.models import CRNNImage2TextConfig, CRNNImage2Text
from hezar.preprocessors import ImageProcessor
from hezar.trainer import Trainer, TrainerConfig

from hezar.data import OCRDataset, OCRDatasetConfig


class PersianOCRDataset(OCRDataset):
    def __init__(self, config: OCRDatasetConfig, split=None, **kwargs):
        super().__init__(config=config, split=split, **kwargs)

    def _load(self, split=None):
        # Load a dataframe here and make sure the split is fetched
        data = pd.read_csv(self.config.path)
        # preprocess if needed
        return data

    def __getitem__(self, index):
        # Do anything you want with your data, just make sure that the output must be dictionary of "pixel_values" and "labels"
        path, text = self.data.iloc[index].values()
        pixel_values = self.image_processor(path, return_tensors="pt")["pixel_values"][0]
        labels = self._text_to_tensor(text)
        inputs = {
            "pixel_values": pixel_values,
            "labels": labels,
        }
        return inputs


dataset_config = OCRDatasetConfig(
    path="path/to/csv",
    text_split_type="char_split",
    text_column="label",
    images_paths_column="image_path",
    reverse_digits=True,
)

train_dataset = PersianOCRDataset(dataset_config, split="train")
eval_dataset = PersianOCRDataset(dataset_config, split="test")

model = CRNNImage2Text(
    CRNNImage2TextConfig(
        id2label=train_dataset.config.id2label,
        map2seq_in_dim=1024,
        map2seq_out_dim=96
    )
)
preprocessor = ImageProcessor(train_dataset.config.image_processor_config)

train_config = TrainerConfig(
    output_dir="crnn-plate-fa-v1",
    task="image2text",
    device="cuda",
    batch_size=8,
    num_epochs=20,
    metrics=["cer"],
    metric_for_best_model="cer"
)

trainer = Trainer(
    config=train_config,
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=train_dataset.data_collator,
    preprocessor=preprocessor,
)
trainer.train()

Thanks for the advice, Now I have another question. When I have a structure completely similar to the structure of the Hezar Dataset("hezarai/persian-license-plate-v1"), in order to define separate sets for training and evaluation, I have to give it a separate and dedicated CSV file. In this case, I will implement these conditions in the load() of the PersianOCRDataset class Something similar to the following:
csv_files = {
"train": "path-of-persian_license_plate_train.csv",
"test": "path-of-persian_license_plate_test.csv",
"val": "path-of-persian_license_plate_val.csv"
}
csv_file_path = csv_files.get(split)
data = pd.read_csv(csv_file_path)
So what is the role of the path in the arguments of OCRDatasetConfig?
When I read csv file in the above way (without path in initialize OCRDatasetConfig), this error occurs:
preprocessor = ImageProcessor(train_dataset.config.image_processor_config)

 [81](file:/hezar/preprocessors/image_processor.py:81)     Initializes the ImageProcessor.

27 def init(self, config: PreprocessorConfig, **kwargs):
[28]preprocessor.py:28) verify_dependencies(self, self.required_backends) # Check if all the required dependencies are installed
---> self.config = config.update(kwargs)

AttributeError: 'NoneType' object has no attribute 'update'
It's probably because it can't find a .yaml file for Image_Preprocessor configs or an object that initialize parameters.
How can I solve this problem?

Hi @Mostafa79modaqeq . Sorry for my late response. (Github does not notify me if I'm not @mentioned in the issues)
Your method is actually pretty solid. The only thing is that your dataset needs to receive a image_processor_config object in the config which is an instance of a ImageProcessorConfig dataclass and the previous code I gave you actually misses it too!
I don't know how you have defined other parameters in your dataset config but a sample like below would do the trick:

import pandas as pd

from hezar.models import CRNNImage2TextConfig, CRNNImage2Text
from hezar.preprocessors import ImageProcessor, ImageProcessorConfig
from hezar.trainer import Trainer, TrainerConfig

from hezar.data import OCRDataset, OCRDatasetConfig


class PersianOCRDataset(OCRDataset):
    def __init__(self, config: OCRDatasetConfig, split=None, **kwargs):
        super().__init__(config=config, split=split, **kwargs)

    def _load(self, split=None):
        # Load a dataframe here and make sure the split is fetched
        data = pd.read_csv(self.config.path)
        # preprocess if needed
        return data

    def __getitem__(self, index):
        # Do anything you want with your data, just make sure that the output must be dictionary of "pixel_values" and "labels"
        path, text = self.data.iloc[index].values()
        pixel_values = self.image_processor(path, return_tensors="pt")["pixel_values"][0]
        labels = self._text_to_tensor(text)
        inputs = {
            "pixel_values": pixel_values,
            "labels": labels,
        }
        return inputs


dataset_config = OCRDatasetConfig(
    path="path/to/csv",
    text_split_type="char_split",
    text_column="label",
    images_paths_column="image_path",
    reverse_digits=True,
    image_processor_config=ImageProcessorConfig(
        gray_scale=True,
        mean=[0.6595],
        std=[0.1501],
        mirror=True,
        rescale=1/255.0,
        size=(256, 64),
    )
)

train_dataset = PersianOCRDataset(dataset_config, split="train")
eval_dataset = PersianOCRDataset(dataset_config, split="test")

model = CRNNImage2Text(
    CRNNImage2TextConfig(
        id2label=train_dataset.config.id2label,
        map2seq_in_dim=1024,
        map2seq_out_dim=96
    )
)
model.preprocessor = train_dataset.image_processor

train_config = TrainerConfig(
    output_dir="crnn-plate-fa-v1",
    task="image2text",
    device="cuda",
    batch_size=8,
    num_epochs=20,
    metrics=["cer"],
    metric_for_best_model="cer"
)

trainer = Trainer(
    config=train_config,
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=train_dataset.data_collator,
)
trainer.train()

Hello @arxyzan, I sincerely appreciate your responsiveness. I understand how important your responsiveness is in these work conditions. I apologize for my frequent questions due to my lack of experience in programming. I have made the suggested changes and started training. The following error occurs:
KeyError.txt
training info :
Output Directory: crnn-plate-fa-v1
Task: image2text
Model: CRNNImage2Text
Init Weights: N/A
Device(s): cpu
Batch Size: 8
Epochs: 20
Training Dataset: PersianOCRDataset(path=ocr['train'], size=7962)
Evaluation Dataset: PersianOCRDataset(path=ocr['test'], size=995)
Optimizer: adam
Scheduler: None
Initial Learning Rate: 2e-05
Learning Rate Decay: 0.0
Number of Parameters: 9269001
Number of Trainable Parameters: 9269001
Mixed Precision: Full (fp32)
Metrics: ['cer']
Checkpoints Path: crnn-plate-fa-v1\checkpoints
Logs Path: crnn-plate-fa-v1\logs\Mar17_12-05-35_DESKTOP-EL4M7VQ

The suggestion that ChatGPT gives me is to change the _text_to_tensor method from the OCRDataset class.
Is it correct?
Thanks a lot

@Mostafa79modaqeq This error occurs since a character (\\u200d) is not present in the list of available labels. You can actually inspect the id2label dictionary:

print(train_dataset.config.id2label)

You can also extract all the desired labels you want from your dataset and pass it in dataset config like below:

...
# Extract id2label from your dataset
labels_set = list(set("".join(df["label"])))
id2label = {i: c for i, c in enumerate(labels_set)}

dataset_config = OCRDatasetConfig(
    path="path/to/csv",
    text_split_type="char_split",
    text_column="label",
    images_paths_column="image_path",
    # 
    id2label=id2label,  # PASS ID2LABEL SO THAT THE KEY ERROR DOES NOT HAPPEN ANYMORE
    # 
    reverse_digits=True,
    image_processor_config=ImageProcessorConfig(
        gray_scale=True,
        mean=[0.6595],
        std=[0.1501],
        mirror=True,
        rescale=1/255.0,
        size=(256, 64),
    )
)
...