/gnrd_performance_test

Primary LanguageJupyter NotebookBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

GNFinder Performance Test

This repository shows the analysis and results for a test of GNFinder performance as a tool for finding taxonomic names in text and adding them to metadata.

The code for GNFinder is here.

dryad-paper is a nested repository containing the text files that were read by GNFinder and the csv and json output. The name of the subfolder is a number that matches the data package identifier.

GNRD_performance_text.ipynb is a Jupyter notebook showing how the performance metrics were calculated from the results in dryad-paper.

annotator_results.txt is the annotator results. Tab-delimited file

  • column 'string' contains the name string as found in the text
  • column 'data_package' contains the number identifier of the data package from which the name string was found
  • column 'data_type' either data or pdf. Indicates if the name string was found in a data file or in the published pdf
  • column 'source' either annotator_1 or annotator_2. Indicates which annotator found the name string
  • column 'string_type' contains the 'vernacular' tag

errors.txt is a tab-delimited list of all false positives and false negatives

  • column 'data_package' is the number identifier of the data package where the error originates
  • column 'name' is the name string that is the false positive or false negative
  • column 'error_type' is false positive or false negative

results.tsv is a tab-delimited list of the results per data package publication. This is the output of the Jupyter notebook

  • first column is unnamed index
  • column 'data_package' number identifier for the data package for the publication
  • column 'true_positives' number of true positives
  • column 'false_positives' number of false positives
  • column 'false_negatives' number of false negatives
  • column 'returned_results' total number of results returned by GNFinder
  • column 'annotator_results' total number of results returned by the annotator
  • column 'precision' calculated precision
  • column 'recall' calculated recall
  • column 'F1' calculated F1 Score
  • column 'total_words' total words in the publication

linnaeus_results is a folder that contains output from the LINNAEUS tool. Each file is the output from one text file.

taxonerd_results is a folder that contains output from the TaxoNERD tool. Each file is the output from one text file.

testing is a folder that contains the calculated performance for each tool: GNFinder, LINNAEUS, Quaesitor, and TaxoNERD. Every file has the same structure.

  • column 'data_package' number identifier for the data package for the publication
  • column 'true_positives' number of true positives. The last row is the sum of the column.
  • column 'false_positives' number of false positives. The last row is the sum of the column.
  • column 'false_negatives' number of false negatives. The last row is the sum of the column.
  • column 'returned_results' total number of results returned by GNFinder. The last row is the sum of the column.
  • column 'annotator_results' total number of results returned by the annotator. The last row is the sum of the column.
  • column 'precision' calculated precision. The last row is the precision calculated from the sums of each column.
  • column 'recall' calculated recall. The last row is the recall calculated from the sums of each column.
  • column 'F1' calculated F1 Score. The last row is the F1 Score calculated from the sums of each column.
  • column 'total_words' total words in the publication. The last row is the sum of the column.

all_pubs.txt is a tab delimited list of all the data package identifiers and their corresponding publication DOIs

  • column 'data_package_id' number identifier for the data package
  • column 'publication_doi' DOI for publication in data package

quaesitor_results.txt is a tab delimited file containing the results from the Quaesitor too.

  • column 'data_package' number identifier for the data package for the publication
  • column 'quaesitor_results name returned by Quaesitor for the publication in the data package

testing_pubs.txt is a tab delimited list of the data package identifiers and their corresponding publication DOIs for the data packages that were used to test LINNAEUS, GNFinder, TaxoNERD, and Quaesitor

  • column 'data_package_id' number identifier for the data package
  • column 'publication_doi' DOI for publication in data package