path to data directory
jzhoubu opened this issue · 6 comments
Hi, thanks for sharing.
I am quite confused at the path to the directory of AMR dataset. During data preprocessing, I put my data under gtos/generator_data/data/AMR/LDC2017T10
, things go well. However, during, Vocab & Data Preparation, I need to move the data
directory to gtos/data/AMR/...
to make sh prepare.sh
work.
Below is the log when I execute sh prepare.sh
under gtos/generator
. Is this running as expectation?
bad attribute li p2 None
bad concept instance i3 imperative
bad attribute value s Auf
bad attribute li n None
bad concept instance i imperative
bad attribute value s compatible
bad concept instance i imperative
bad concept instance i2 imperative
bad attribute value s2 Qui
bad concept instance e expressive
polarity ORDINAL_ENTITY_1 - abstract concept cannot have an attribute
bad attribute value s3 (
bad attribute value s hope
bad attribute li a None
bad attribute value s2 critical
bad attribute quant p None
bad attribute value s4 It's
mode DATE_ATTRS_1 interrogative abstract concept cannot have an attribute
bad attribute value a yar
bad concept instance e2 expressive
bad attribute value a2 are
polarity ORDINAL_ENTITY_1 - abstract concept cannot have an attribute
bad attribute value s Erneuerbare
polarity DATE_ATTRS_1 - abstract concept cannot have an attribute
bad attribute value s2 no
bad attribute value s Qaid
bad attribute value s Rahmatul
bad attribute value s Auf
bad concept instance i3 imperative
bad concept instance i imperative
polarity ORDINAL_ENTITY_1 - abstract concept cannot have an attribute
polarity ORDINAL_ENTITY_1 - abstract concept cannot have an attribute
bad concept instance i imperative
read from ../data/AMR/amr_2.0/train.txt.features.preproc, 36521 amrs
tot_paths 8993632 avg_path_length 4.038885068902085
extreme_long_paths 477832 extreme_long_paths_percentage 0.053130036897217944
multi_path_percentage 993488 7720428 0.12868302120037906
predictable token coverage (1. - copyable token coverage) 401995 624750 0.6434493797519008
make vocabularies
bad attribute li p2 None
bad concept instance i3 imperative
bad attribute value s Auf
bad attribute li n None
bad concept instance i imperative
bad attribute value s compatible
bad concept instance i imperative
bad concept instance i2 imperative
bad attribute value s2 Qui
bad concept instance e expressive
polarity ORDINAL_ENTITY_1 - abstract concept cannot have an attribute
bad attribute value s3 (
bad attribute value s hope
bad attribute li a None
bad attribute value s2 critical
bad attribute quant p None
bad attribute value s4 It's
mode DATE_ATTRS_1 interrogative abstract concept cannot have an attribute
bad attribute value a yar
bad concept instance e2 expressive
bad attribute value a2 are
polarity ORDINAL_ENTITY_1 - abstract concept cannot have an attribute
bad attribute value s Erneuerbare
polarity DATE_ATTRS_1 - abstract concept cannot have an attribute
bad attribute value s2 no
bad attribute value s Qaid
bad attribute value s Rahmatul
bad attribute value s Auf
bad concept instance i3 imperative
bad concept instance i imperative
polarity ORDINAL_ENTITY_1 - abstract concept cannot have an attribute
polarity ORDINAL_ENTITY_1 - abstract concept cannot have an attribute
bad concept instance i imperative
read from ../data/AMR/amr_2.0/train.txt.features.preproc, 36521 amrs
read from ../data/AMR/amr_2.0/dev.txt.features.preproc, 1368 amrs
bad attribute quant t None
bad concept instance i imperative
day DATE_ATTRS_3 1 abstract concept cannot have an attribute
read from ../data/AMR/amr_2.0/test.txt.features.preproc, 1371 amrs
By the way, what is the proper parameter to reproduce AMR-to-text task?
I am running experiments on a 4*P100(16G per GPU) server. By default, --train_batch_size 66666
and --dev_batch_size 44444
. Is unusual to see such a large batch size, is this a random number or the correct parameter to reproduce the result?
@sysu-zjw Hi, thanks for your interests.
The printed log seems good! Actually, you can choose to change the dataset path config in prepare.sh
rather than moving the data. See
Line 1 in e81578b
The default settings in this repo should reproduce our results. For the unusual batch size, it is counting sum(n+m^2), where n is the number of tokens and m is the number of nodes. Therefore, it seems quite large. You can learn more details in data.py
, more specifically, in the following line.
Line 300 in e81578b
Hi, @jcyk, thx for your reply!
It seems I cannot train such batch size on my device (16G GPU * 4). May I ask what kind of device you are training with.
hi @sysu-zjw
Our AMR experiments are conducted using a single GPU with 24GB memory.
I think you may need to adjust the batch size to your devices. A hint for you is that the master GPU needs more space when using multiple GPUs.
also, I recall that the maximum memory cost is much larger than the average cost. a possible workaround is to identify and split those very memory-consuming batches.
let me know if you have further questions.