/gnfinder

Ruby gem to access functionality of ``gnfinder`` project written in Go

Primary LanguageRubyMIT LicenseMIT

gnfinder

Ruby gem to access functionality of GNfinder project written in Go. This gem allows to perform fast and accurate scientific name finding in texts, web-pages, as well as a large variety of documents. Document files can be accessed either locally or via a URL.

Requirements

This gem uses REST API to access a running GNfinder server. You can find how to run it in GNfinder README file. By default it uses https://gnfinder.globalnames.org/api/v0

Installation

gem install gnfinder

Usage

The purpose of this gem is to access GNfinder functionality from Ruby applications. If you need to find names using other languages, use the source code of this gem for reference. For other usages read the original Go-lang GNfinder README file.

First you need to create an instance of a gnfinder client

require 'gnfinder'

gf = Gnfinder::Client.new

By default the client will try to connect to https://gnfinder.globalnames.org/api/v0. If you have another location for the server use:

require 'gnfinder'

# you can use global public gnfinder server
# located at finder-rpc.globalnames.org
gf = Gnfinder::Client.new(host = 'finder.example.org', port = 80)

# localhost, port 8000
gf = Gnfinder::Client.new(host = '0.0.0.0', port = 8000)

Finding names in a text using default settings

You can find format of returning result in GNfinder API docs

txt = File.read('utf8-text-with-names.txt')

res = gf.find_names(txt)
puts res.names[0].value
puts res.names[0].odds

Finding names by a URL

If you need to find names in an HTML page, or a PDF document available on Internet, use find_url method.

url = 'https://en.wikipedia.org/wiki/Monochamus_galloprovincialis'
res = gf.find_url(url)
puts res.names[0].value
puts res.names[0].odds

Finding names in a file

Many different file types are supported (PDF, JPB, TIFF, MS Word, MS Excel etc).

path = "/path/to/file.pdf"
res = gf.find_file(path)
puts res.names[0].value

Support of file-uploading uses 'multipart/form' approach. Here is an illustration for curl:

curl -v -F sources[]=1 -F sources[]=12 -F file=@file.pdf \
    https://finder.globalnames.org/api/v0/find

Returned result is quite detailed and contains many accessor methods, for example:

  • value: name-string cleaned up for verification.
  • verbatim: name-string as it was found in the text.
  • odds: Bayes' odds value. For example odds 0.1 would mean that according to the algorithm there is 1 chance out of 10 that the name-string is a scientific name. This field will be empty if Bayes algorithms did not run.

Optionally disable Bayes search

Some languages that are close to Latin (Italian, French, Portugese) would generate too many false positives. To decrease amount of false positives you can disable Bayes algorithm by running:

names = gf.find_names(txt, no_bayes: true).names

Set a language for the text

It is possible to supply the prevalent language to set a language for a text by hand. That might Bayes algorithms work better

List of supported languages will increase with time.

res = gf.find_names(txt, language: 'eng')
puts res.language
res = gf.find_names(txt, language: 'deu')
puts res.language

# Setting is ignored if language string is not known by gnfinder.
# Only 3-character notations iso-639-2 code are supported
res = gf.find_names(txt, language: 'rus')
puts res.language

Set automatic detection of text's language

To enable automatic detection of prevalent language of a text use:

res = gf.find_names(txt, detect_language: true) puts res.language puts res.detect_language puts res.language_detected

If detected language is not yet supported by Bayes algorithm, default language (English) will be used.

Set verification option

In case if found names need to be validated against a large collection of name-strings, use with_verification option. For each name algorithm will return the following information:

  • match type:
    • NONE: name-string is unknown
    • EXACT: name-string matched exactly.
    • CANONICAL_EXACT: canonical form of a name-string matched exactly.
    • CANONICAL_FUZZY: fuzzy match of a canonical string.
    • PARTIAL_EXACT: only part of a name matched. For examle only genus of a species was found.
    • PARTIAL_FUZZY: fuzzy match of a partial result. For example canonical form of a trinomial matched on a species level with some corrections.
  • source_id: ID of a data-source of a best matched result. Data source IDs can be compared with the data-source list
  • curated: true if name-string was found in some data-sources that are deemed to curated by humans.
  • path: the classification path of a matched name (if available)
res = gf.find_names(txt, verification: true)

Set preferred data-sources list

Sometimes it is important to know if a name was found in a particular data-source (data-sources). There is a parameter that takes IDs from the data-source list. If a name-string was found in these data-sources, match results will be returned back.

res = gf.find_names(txt, verification: true, sources: [1, 4, 179])

Combination of parameters.

It is possible to combine parameters. However if a parameter makes no sense in a particular context it is silently ignored.

# Runs Bayes' algorithms using English training set, runs verification and
# returns matched results for 3 data-sources if they are available.
res = gf.find_names(txt, language: 'eng', verification: true,
                           sources: [1, 4, 179])

# Ignores `sources:` settings, because `with_verification` is not set to `true`
res = gf.find_names(txt, language: 'eng', sources: [1, 4, 179])

Development

If you get an error, you might need to set a GOPATH environment variable.

After starting the server with default host and port (localhost:8778) you will be able to run tests for this Ruby client with:

bundle exec rake

To run rubocop test

bundle exec rake rubocop

To run tests without rubocop

bundle exec rspec