google-research/text-to-text-transfer-transformer

Question on reproducing (Please help me delete this post, posted here by mistake)

zluw1117 opened this issue · 2 comments

Hi, @tscholak ,I'm trying to reproduce your CoSQL model trained based on t5.1.1.lm100k.large.

I trained the CoSQL model with a p3.16xlarge EC2 instance without using db_content (using 8 GPU, mini batch size per device = 1, gradient accumulation steps = 250, so that my batch_size is 2000). Given the model saved on /home/ubuntu/code/src/t5-v1_1-large is downloaded from gs://t5-data/pretrained_models/t5.1.1.lm100k.large, here is the config used in my model training:

{
    "run_name": "t5-cosql",
    "model_name_or_path": "/home/ubuntu/code/src/t5-v1_1-large",
    "dataset": "cosql+spider",
    "source_prefix": "",
    "schema_serialization_type": "peteshaw",
    "schema_serialization_randomized": false,
    "schema_serialization_with_db_id": true,
    "schema_serialization_with_db_content": false,
    "normalize_query": true,
    "target_with_db_id": true,
    "output_dir": "/home/ubuntu/code/src/code_train",
    "cache_dir": "/home/ubuntu/code/src/code_transformers_cache",
    "do_train": true,
    "do_eval": true,
    "fp16": false,
    "num_train_epochs": 250,
    "per_device_train_batch_size": 1,
    "per_device_eval_batch_size": 1,
    "gradient_accumulation_steps": 250,
    "label_smoothing_factor": 0.0,
    "learning_rate": 1e-4,
    "adafactor": true,
    "adam_eps": 1e-6,
    "lr_scheduler_type": "constant",
    "warmup_ratio": 0.0,
    "warmup_steps": 0,
    "seed": 1,
    "report_to": ["wandb"],
    "logging_strategy": "steps",
    "logging_first_step": true,
    "logging_steps": 4,
    "load_best_model_at_end": true,
    "metric_for_best_model": "exact_match",
    "greater_is_better": true,
    "save_total_limit": 64,
    "save_steps": 64,
    "evaluation_strategy": "steps",
    "eval_steps":64,
    "predict_with_generate": true,
    "num_beams": 1,
    "num_beam_groups": 1,
    "use_picard": false
}

For your CoSQL model (https://huggingface.co/tscholak/2jrayxos) and my model, I run evaluation on eval docker image with Picard enabled. Here's what I got:

Your model achieved

eval_exact_match = 0.5433 
eval_exec = 0.6324

while my model only obtained

eval_exact_match = 0.5069
eval_exec = 0.5935

For both metrics, I am 4 percentage points away from your model performance. That seems like a big difference.
Does the config look good to you? Any tips on training t5.1.1.lm100k.large based models? Is there anything I miss for this reproducing experiment? Thank you.

Hi, Did you mean to open this issue in the PICARD repository? Putting it here is a bit odd.
Your config looks fine. You won't get the same performance as I got without db content, though. Furthermore, you want to turn on PICARD constrained inference for maximum accuracy.
Torsten

Thank you for your quick reply, Torsten.
Yeah, you are right. I posted the issue to wrong repository. Let me copy the issue to PICARD repository.

Can anyone help me delete this issue because I posted it here by mistake? Sorry.