USCDataScience/parser-indexer-py

Investigate why jSRE returns no relations for MPF/PHX documents

Closed this issue ยท 20 comments

wkiri commented

Yuan's extracted feature vectors and predictions are here:
mlia-fn:/home/yzhuang/jsre-examples/Merged/test-goldentity

wkiri commented

Suggestions:

  • Test on MSL 2015 and 2016 docs (in MTE repo) to see if they generate any output

@wkiri The problem of not finding relations for JSRE parser should have been resolved. I suspect the problem is caused by the absence of the following two JSRE configuration files in the working directory where parse_all.py is executed.

/proj/mte/jSRE/jsre-1.1/log-config.txt
/proj/mte/jSRE/jsre-1.1/jsre-config.xml

These two files are required in the working directory for the first time JSRE is invoked. After the first time, these two files will be cached for as long as the session to mlia-compute1 is alive.

I didn't confirm if this is the case why parse_all.py failed to find relations from PHX/MPF docs. I can confirm for parse_all.py, but I don't think it is necessary (please let me know if you would like to confirm for parse_all.py) as we have moved on to use the new parser scripts. The absence of the two JSRE config files in the working directory is definitely the failure of jsre_parser.py.

I've updated the jsre_parser.py to copy the two JSRE config files to the working directory, and it seems this approach is effective. I tested the lpsc_parser.py on 591 MPF docs again, and now we find relations from 28 docs. The output jsonl of the test run can be found at the following location if you would like to take a look (please grep "rel": [{ to find the extracted relations).

/home/youlu/MTE/working_dir/mte_parse_journals/verification_test/lpsc_mpf_lpsc_parser6.jsonl
wkiri commented

@stevenlujpl Excellent! I am so glad you figured out the issue. I agree that we do not need to investigate parse_all.py further but instead can focus on the new scripts.

wkiri commented

@stevenlujpl Can you share the full command line arguments you used to run jsre_parser.py? I am getting a 422 error from Tika, and it is not clear to me how the Tika server URL gets specified. In jsre-parser-log.txt, I have:

[2021-07-22 09:14:51]: Input parameters
[2021-07-22 09:14:51]: in_file: None
[2021-07-22 09:14:51]: in_list: /home/youlu/MTE/working_dir/mte_parse_journals/verification_test/lpsc_mpf_subset.list
[2021-07-22 09:14:51]: out_file: mpf-jsre-test.jsonl
[2021-07-22 09:14:51]: tika_server_url: None
[2021-07-22 09:14:51]: corenlp_server_url: http://localhost:9000
[2021-07-22 09:14:51]: ner_model: /proj/mte/trained_models/mpf_ner_train_lpsc15n16_emt_gazette.ser.gz
[2021-07-22 09:14:51]: jsre_root: /proj/mte/jSRE/jsre-1.1
[2021-07-22 09:14:51]: jsre_model: /proj/mte/trained_models/jSRE-lpsc15-merged-binary.model
[2021-07-22 09:14:51]: jsre_tmp_dir: /tmp
[2021-07-22 09:14:51]: ads_url: https://api.adsabs.harvard.edu/v1/search/query
[2021-07-22 09:14:51]: ads_token: jON4eu4X43ENUI5ugKYc6GZtoywF376KkKXWzV8U
[2021-07-22 09:14:51]: JSRE parser failed: /proj/mte/data/corpus-lpsc/mpf-pdf/1999_1736.pdf
[2021-07-22 09:14:51]: 'NoneType' object has no attribute 'keys'
Traceback (most recent call last):
  File "../git/parser-indexer-py/src/parserindexer/jsre_parser.py", line 229, in process
    ads_dict = ads_parser.parse(f)
  File "/home/wkiri/Research/MTE/git/parser-indexer-py/src/parserindexer/ads_parser.py", line 172, in parse
    query_str = self.construct_query_string(tika_dict, query_dict)
  File "/home/wkiri/Research/MTE/git/parser-indexer-py/src/parserindexer/ads_parser.py", line 92, in construct_query_string
    query_str = AdsParser.construct_title_query_string(tika_dict)
  File "/home/wkiri/Research/MTE/git/parser-indexer-py/src/parserindexer/ads_parser.py", line 102, in construct_title_query_string
    if 'grobid:header_Title' in tika_dict['metadata'].keys():
AttributeError: 'NoneType' object has no attribute 'keys'
wkiri commented

To clarify, I know that I can specify a Tika URL, but since it is not required, there should be a default somewhere? It says "None" here so maybe I do need to specify something?

wkiri commented

Well, this is strange. The output above is what I get on mlia-fn. If I run on mlia-compute1, I do not get the Tika error. However, it seems there are multiple matches coming from ADS for the first document, and this causes downstream issues:

[2021-07-22 09:22:25]: Input parameters
[2021-07-22 09:22:25]: in_file: None
[2021-07-22 09:22:25]: in_list: /home/youlu/MTE/working_dir/mte_parse_journals/verification_test/lpsc_mpf_subset.list
[2021-07-22 09:22:25]: out_file: mpf-jsre-test.jsonl
[2021-07-22 09:22:25]: tika_server_url: None
[2021-07-22 09:22:25]: corenlp_server_url: http://localhost:9000
[2021-07-22 09:22:25]: ner_model: /proj/mte/trained_models/mpf_ner_train_lpsc15n16_emt_gazette.ser.gz
[2021-07-22 09:22:25]: jsre_root: /proj/mte/jSRE/jsre-1.1
[2021-07-22 09:22:25]: jsre_model: /proj/mte/trained_models/jSRE-lpsc15-merged-binary.model
[2021-07-22 09:22:25]: jsre_tmp_dir: /tmp
[2021-07-22 09:22:25]: ads_url: https://api.adsabs.harvard.edu/v1/search/query
[2021-07-22 09:22:25]: ads_token: jON4eu4X43ENUI5ugKYc6GZtoywF376KkKXWzV8U
[2021-07-22 09:22:27]: /home/wkiri/Research/MTE/git/parser-indexer-py/src/parserindexer/ads_parser.py:153: UserWarning: [Warning] There are multiple documents returned from the ADS database, and we are using the first document.
  warnings.warn('[Warning] There are multiple documents returned '

[2021-07-22 09:22:27]: JSRE parser failed: /proj/mte/data/corpus-lpsc/mpf-pdf/1999_1736.pdf
[2021-07-22 09:22:27]: list indices must be integers, not str
Traceback (most recent call last):
  File "../git/parser-indexer-py/src/parserindexer/jsre_parser.py", line 229, in process
    ads_dict = ads_parser.parse(f)
  File "/home/wkiri/Research/MTE/git/parser-indexer-py/src/parserindexer/ads_parser.py", line 178, in parse
    ads_dict = self.query_ads_database(query_str)
  File "/home/wkiri/Research/MTE/git/parser-indexer-py/src/parserindexer/ads_parser.py", line 156, in query_ads_database
    warnings.warn(json.dumps(data_docs['title']))
TypeError: list indices must be integers, not str
[2021-07-22 09:22:37]: /home/wkiri/Research/MTE/git/parser-indexer-py/src/parserindexer/ads_parser.py:153: UserWarning: [Warning] There are multiple documents returned from the ADS database, and we are using the first document.
  warnings.warn('[Warning] There are multiple documents returned '

[2021-07-22 09:22:37]: JSRE parser failed: /proj/mte/data/corpus-lpsc/mpf-pdf/2006_2191.pdf
[2021-07-22 09:22:37]: list indices must be integers, not str
Traceback (most recent call last):
  File "../git/parser-indexer-py/src/parserindexer/jsre_parser.py", line 229, in process
    ads_dict = ads_parser.parse(f)
  File "/home/wkiri/Research/MTE/git/parser-indexer-py/src/parserindexer/ads_parser.py", line 178, in parse
    ads_dict = self.query_ads_database(query_str)
  File "/home/wkiri/Research/MTE/git/parser-indexer-py/src/parserindexer/ads_parser.py", line 156, in query_ads_database
    warnings.warn(json.dumps(data_docs['title']))
TypeError: list indices must be integers, not str
wkiri commented

I used /home/youlu/MTE/working_dir/mte_parse_journals/verification_test/lpsc_mpf_subset.list as the input here assuming it might be a smaller test set (vs. all 591 in lpsc_mpf.list). However, maybe instead these are a "problematic" subset? :) I am now running on lpsc_mpf.list and so far it is going fine. I will let you know the outcome.

Are you using jsre_parser.py?

wkiri commented

@stevenlujpl Yes, from my local checkout (I wanted to test here before updating in /proj/mte):

$ python ../git/parser-indexer-py/src/parserindexer/jsre_parser.py -li /home/youlu/MTE/working_dir/mte_parse_journals/verification_test/lpsc_mpf.list -o mpf-jsre-test.jsonl -jr /proj/mte/jSRE/jsre-1.1 -jm /proj/mte/trained_models/jSRE-lpsc15-merged-binary.model -n /proj/mte/trained_models/mpf_ner_train_lpsc15n16_emt_gazette.ser.gz 

It seems there is a problem for the jsre_parser.py script. It was tested before I made the change in ADS parser for queries using year + abstract id + venue. This change might break some code in jsre_parser.py.

For now, can you please use lpsc_parser.py? The arguments should be the same as the ones you used for jsre_parser.py.

I am not sure about TIKA URL. This is how it was implemented previously. It is working fine without specifying a default URL, so I didn't investigate further to see how it works.

wkiri commented

If I just run lpsc_parser.py then I won't get relations extracted, right? I am trying to test that the "rel" field is populated.

wkiri commented

(I am trying to repeat your above experiment)

You will. Infact, we should always use lpsc_parser.py or parsers that are sub-classes of paper_parser.

Each parser script contains a class and a function that is tailored for our specific use case. For example, if we look at the lpsc_parser.py script. It has a class LpscParser and a function process. The parse function of the LpscParser class is responsible to remove LPSC header. The process function invokes multiple parsers for our use case. The process function in lpsc_parser.py actually invokes tika, ads, paper, lpsc, corenlp, and jsre parsers, so it will have relations extracted and saved in the rel field.

wkiri commented

Thank you for reminding me of that ordering. In your above comment it said "I tested the jsre_parser.py on 591 MPF docs again" so I thought I should try jsre_parser.py here :) Actually, that run just finished (22 mins) and it gave me 24 results with relations:

$ grep '"rel": \[{' mpf-jsre-test.jsonl  | wc -l

I will now run with lpsc_parser.py as another test.

Sorry about the confusion. It is a typo, and I meant lpsc_parser.py.

I edited my post above to change I tested jsre_parser.py to I tested lpsc_parpser.py.

@wkiri I haven't confirmed it yet, but I think if the change I made in ADS parser breaks the process function in JSRE parser, it most likely breaks the process functions in other parsers except lpsc_parser.py. The functionalities in the Class definitions should be fine. This means if you directly run a parser other than LPSC parser, it most likely will fail.

Everything in LPSC parser should work properly. I think this is what we need for now. I will fix the problems in other parsers after I come back from vacation.

wkiri commented

@stevenlujpl Thank you for this update! My run with lpsc_parser.py on the full MPF list completed fine (took 16.5 mins on mlia-compute1) and I get 28 documents with "rel" contents, as you reported. Looking good!

wkiri commented

I will go ahead and close this issue since it is working with lpsc_parser.py. I'll use your comment above to open a new issue that can be addressed after your vacation. Thank you for your help with this!