jcyk/gtos

path to data directory

jzhoubu opened this issue · 6 comments

Hi, thanks for sharing.

I am quite confused at the path to the directory of AMR dataset. During data preprocessing, I put my data under gtos/generator_data/data/AMR/LDC2017T10, things go well. However, during, Vocab & Data Preparation, I need to move the data directory to gtos/data/AMR/... to make sh prepare.sh work.

Below is the log when I execute sh prepare.sh under gtos/generator. Is this running as expectation?

bad attribute li p2 None
bad concept instance i3 imperative
bad attribute value s Auf
bad attribute li n None
bad concept instance i imperative
bad attribute value s compatible
bad concept instance i imperative
bad concept instance i2 imperative
bad attribute value s2 Qui
bad concept instance e expressive
polarity ORDINAL_ENTITY_1 - abstract concept cannot have an attribute
bad attribute value s3 (
bad attribute value s hope
bad attribute li a None
bad attribute value s2 critical
bad attribute quant p None
bad attribute value s4 It's
mode DATE_ATTRS_1 interrogative abstract concept cannot have an attribute
bad attribute value a yar
bad concept instance e2 expressive
bad attribute value a2 are
polarity ORDINAL_ENTITY_1 - abstract concept cannot have an attribute
bad attribute value s Erneuerbare
polarity DATE_ATTRS_1 - abstract concept cannot have an attribute
bad attribute value s2 no
bad attribute value s Qaid
bad attribute value s Rahmatul
bad attribute value s Auf
bad concept instance i3 imperative
bad concept instance i imperative
polarity ORDINAL_ENTITY_1 - abstract concept cannot have an attribute
polarity ORDINAL_ENTITY_1 - abstract concept cannot have an attribute
bad concept instance i imperative
read from ../data/AMR/amr_2.0/train.txt.features.preproc, 36521 amrs
tot_paths 8993632 avg_path_length 4.038885068902085
extreme_long_paths 477832 extreme_long_paths_percentage 0.053130036897217944
multi_path_percentage 993488 7720428 0.12868302120037906
predictable token coverage (1. - copyable token coverage) 401995 624750 0.6434493797519008
make vocabularies
bad attribute li p2 None
bad concept instance i3 imperative
bad attribute value s Auf
bad attribute li n None
bad concept instance i imperative
bad attribute value s compatible
bad concept instance i imperative
bad concept instance i2 imperative
bad attribute value s2 Qui
bad concept instance e expressive
polarity ORDINAL_ENTITY_1 - abstract concept cannot have an attribute
bad attribute value s3 (
bad attribute value s hope
bad attribute li a None
bad attribute value s2 critical
bad attribute quant p None
bad attribute value s4 It's
mode DATE_ATTRS_1 interrogative abstract concept cannot have an attribute
bad attribute value a yar
bad concept instance e2 expressive
bad attribute value a2 are
polarity ORDINAL_ENTITY_1 - abstract concept cannot have an attribute
bad attribute value s Erneuerbare
polarity DATE_ATTRS_1 - abstract concept cannot have an attribute
bad attribute value s2 no
bad attribute value s Qaid
bad attribute value s Rahmatul
bad attribute value s Auf
bad concept instance i3 imperative
bad concept instance i imperative
polarity ORDINAL_ENTITY_1 - abstract concept cannot have an attribute
polarity ORDINAL_ENTITY_1 - abstract concept cannot have an attribute
bad concept instance i imperative
read from ../data/AMR/amr_2.0/train.txt.features.preproc, 36521 amrs
read from ../data/AMR/amr_2.0/dev.txt.features.preproc, 1368 amrs
bad attribute quant t None
bad concept instance i imperative
day DATE_ATTRS_3 1 abstract concept cannot have an attribute
read from ../data/AMR/amr_2.0/test.txt.features.preproc, 1371 amrs

By the way, what is the proper parameter to reproduce AMR-to-text task?
I am running experiments on a 4*P100(16G per GPU) server. By default, --train_batch_size 66666 and --dev_batch_size 44444. Is unusual to see such a large batch size, is this a random number or the correct parameter to reproduce the result?

jcyk commented

@sysu-zjw Hi, thanks for your interests.

The printed log seems good! Actually, you can choose to change the dataset path config in prepare.sh rather than moving the data. See

dataset=../data/AMR/amr_2.0

The default settings in this repo should reproduce our results. For the unusual batch size, it is counting sum(n+m^2), where n is the number of tokens and m is the number of nodes. Therefore, it seems quite large. You can learn more details in data.py, more specifically, in the following line.

num_tokens += len(self.data[i]['token']) + len(self.data[i]['concept'])**2

Hi, @jcyk, thx for your reply!
It seems I cannot train such batch size on my device (16G GPU * 4). May I ask what kind of device you are training with.

jcyk commented

hi @sysu-zjw

Our AMR experiments are conducted using a single GPU with 24GB memory.

I think you may need to adjust the batch size to your devices. A hint for you is that the master GPU needs more space when using multiple GPUs.

jcyk commented

also, I recall that the maximum memory cost is much larger than the average cost. a possible workaround is to identify and split those very memory-consuming batches.

jcyk commented

let me know if you have further questions.