errors while preprocessing data
zhangyue1013 opened this issue · 8 comments
When preprocessing data with shell 'script scripts/extract-deepbank-sdp.sh', I meet some question about PyDelphin, such as "keyerror: e3" with data "DMRS".
Because I just want to deal with "EDS" data, so I remove "dmrs" type from the shell script.
But, with 'scripts/preprocess.sh', I also meet a lot of question.
Here, I want to ensure that, is this code can be applied only to 'eds' type without 'dmrs'.
If not, PyDelphin has any requirement about version or other issues?
Hello, I did not write this code but I maintain PyDelphin. I recently pushed a bunch of commits to its primary branch ("develop") on GitHub for the next major release, many of which are not backwards compatible, but it is not yet updated on PyPI so pip install pydelphin
should still get a proper version (for now), but if you cloned the repository recently you'll see the new version. See here for some information about installing: https://pydelphin.readthedocs.io/en/v0.9.2/tutorials/setup.html
DeepDeepParser does not specify a version of PyDelphin, but it uses the previous version, so please ensure you're using v0.9.2 or perhaps lower.
Can you try this and see if it works?
$ pip uninstall pydelphin
$ pip install pydelphin==0.9.2
If you installed PyDelphin with setup.py install
instead of using pip
, then you may need to manually remove the installed files. I recommend using a virtual environment to get a clean install.
If you're sure you're using v0.9.2 or lower, please paste the stack trace of the errors you now get.
@janmbuys Can you add to dependencies.md
the version of PyDelphin you used and perhaps the pip install
command above?
see also #1 for a related query.
there is a ready-to-use reference package with the DeepBank 1.1 EDSs (corresponding to ERG release 1214) available for public download as part of the Open SDP release (look for the ‘2015/eds/’ sub-directory):
http://sdp.delph-in.net/index.php?page=5
in the context of the 2019 CoNLL Shared Task on Cross-Framework Meaning Representation Parsing, we have just minted a fresh release of these EDSs (and more, including out-of-domain evaluation data). if you have a running parser for EDS more or less available, you might still manage to obtain the ‘official’ 2019 version of the data and participate in the evaluation (with a deadline of july 22, 2019):
Thanks for your response.
I have download the Open SDP release and get the ‘2015/eds/’ sub-directory.
My question is occurred while preparing data follow the "Data preparation" in the https://github.com/janmbuys/DeepDeepParser.
I preprocess EDS with 'scripts/extract-deepbank-sdp.sh' without DMRS.
But while running 'scripts/preprocess.sh ' , some other errors appeared.
'python $HOME/DeepDeepParser/mrs/extract_data_lexicon.py $MRS_DIR $MRS_WDIR' in preprocess.sh run wrongly, because in 'deepbank-sdp-eds' generated by 'extract-deepbank-sdp.sh', there is no file whose name ends with 'card' or 'sdmr'.
Similar error occurs while running 'python $HOME/DeepDeepParser/mrs/stanford_to_linear.py $MRS_DIR $MRS_WDIR $MRS_WDIR'.
In fact, we are preparing for the CoNLL2019 shared task, but we has no available parser for EDS, so we want to follow your work.
During this work, I have a question which persecute me a lot.
In EDS graph, we find that there has lots of external nodes which are not appeared in raw sentence like 'implicit_conj' or 'udef_q'.
I don't know how do you generate those nodes.
It seems that those nodes has be generated before training transition-based parser.
Thank you for your reply.
The code was developed in 2017 using PyDelphin v0.6.*, so to avoid backward compatibility issues it might be safest to use that - I'll add a note to Dependencies.md.
Unfortunately, for legacy reasons there is indeed a dependency in the preprocessing to using the DMRS data when preprocessing EDS, as they have the same lexicon (which is being extracted in this step), and for reasons I can't fully recall it was simpler to use the DMRS data for that, so I'm not sure if there is an easy fix for this.
I don't understand your question about EDS nodes completely, but I'll address two issues you might be referring to:
(1) The transition-based parser can generate arbitrary nodes as long as they are annotated with a span during training (so there can be a many-to-many relation between words and nodes), so all the nodes are generated by the parser.
(2) There is more than 1 version of EDS graphs, I believe the data I used originally (so not using the OpenSDP release) might contain node types that were seen as redundant and removed in other releases.
just to be careful: for the CoNLL Shared Task, only the training and companion data provided for the competition can be used, plus additional resources (corpora and word embeddings) that have been specifically white-listed. please see the MRP task web pages for details:
http://mrp.nlpl.eu
http://svn.nlpl.eu/mrp/2019/public/resources.txt
The parser uses the ERG lexicon to map surface forms to ERG surface predicate lemmas; The DMRS data is used to extract additional information for this mapping, but this should be exactly the same as in the EDS data (unless the surface predicate forms are different).
The code should still work (but might be slightly less accurate) if mrs/extract_data_lexicon.py is not run, as it just complements the mapping extracted from the ERG lexicon. The ERG SEM-I is white-listed as a resource for the shared task - the rest of the ERG is not, so I don't know if using the lexicon would be a problem.
i am afraid at this point the white-list for the MRP 2019 task is pretty much carved in stone, and we have to interpret it strictly. thus, any other part of the ERG except the SEM-I (the ‘etc/’ sub-directory of the 1214 release of the ERG) would be illegitimate, including of course the original Redwoods treebank (‘tsdb/gold/’ in the ERG).
@zhangyue1013 just to address the following point...
In EDS graph, we find that there has lots of external nodes which are not appeared in raw sentence like 'implicit_conj' or 'udef_q'.
These nodes ensure the graph is connected and is well-formed. For implicit_conj
, it conjoins subgraphs that do not have an overt conjunction in the surface, such as "Kim likes apples, bananas, and pears", where _and_c
conjoins _banana_n_1
and _pear_n_1
, but the implicit_conj
conjoins that subgraph with _apple_n_1
. The udef_q
node is inserted because MRS-based representations (including DMRS and EDS) currently require all noun-like nodes to have a quantifier. In the previous sentence, "apples", "bananas", and "pears" do not have an overt quantifier or determiner ("the", "all", etc.) so udef_q
is inserted. The conjunctions themselves (_and_c
, implicit_conj
, etc.) also get udef_q
when conjoining nominal things, so you end up with a lot of these in the graph. Generally you can ignore these and have a post-processing rule that inserts them on any unquantified nominal node (although some, like proper names and pronouns, may have a different default quantifier).
I hope that clarifies things a bit.