TeamHeka/medkit

Add IAMsystemMatcher (NER)

Closed this issue · 11 comments

I would like to add support for IAMsystem, an alternative to QuickUMLS used in medkit.

In my opinion, the easiest solution is to deploy a IAMsystem server in a docker container with a configuration file (dictionary file, approximate string matching algorithms...) and have a socket communication between medkit and the Java server to send text to annotate, then parse the json response.

olvb commented

How would it compare to using HTTP? I have close to zero experience of sockets programming but won't we end up re implementing things like handling of variable-length messages, that we would get for free using HTTP client and server libs?

Hi @scossin, it would be really great to have IAMsystem in medkit! What about a solution such as py4j to instantiate directly in python ?

@olvb
We could use HTTP but it may be an overkill for our need, the application protocol can be simpler.
For example Mgrep, the SFIR bioportal annotator, provides a socket API which is a TCP/IP socket connection with an ad-hoc application protocol to transmit the text to annotate. To handle variable-length message, the client (Python) can transmit the content-length, document size in bytes, in the header. Mgrep could be integrated in medkit the same way as IAMsystem with this approach.

@aneuraz
I prefer the network approach over sharing objects in memory ; one advantage is to have the annotator located on another machine and, in my opinion, it would be easier to deploy medkit by the user (no Java configuration nor py4j dependency) and easier to maintain by the developer : I made a R wrapper with rJava but retrospectively I think it was a mistake, there are a lot of classes / methods to be called from R which need to be updated whenever IAMsystem is updated. With the network approach, the client is completely agnostic and as long as the annotator output doesn't change there is no need to update the client.

We can discuss the pros and cons of each method ; the final choice of how to do it will be yours, off course :-)

I see @scossin. The reason why I suggested that is the execution of pipelines in restricted environments. For example, depending on the platform used, it might be difficult to open a socket or TCP/IP connection. Therefore, executing all the pipeline on the same machine becomes an advantage.
I understand that it represents more work, so let us start with your proposition at first.
Would it represent a lot of work to reimplement all in python ?

Thanks @scossin for the proposition ! If you want to encapsulate IAMSystem in a docker container, why not just using docker entrypoints ? @aneuraz : what do you think about this option ? If we use docker container as an executable, is it ok for restricted environments ? By the way, I also prefer the option of reimplementing all in python if possible.
However, it may be interesting to integrate a 'docker-like' annotator if we would like to reuse some existing containerized annotators.

@aneuraz It would be feasible to deploy the docker container on localhost (127.0.0.1 is not affected by a router firewall to open TCP/IP connections) to execute the pipeline on the same machine but it would be harder on a Windows machine/server with the Docker dependency to manage. I agree reimplementing the algorithm in Python would be the best option although time-consuming.
@khuynh11 At initialization, the algorithm loads the dictionary in RAM which takes a few seconds for a large thesaurus, it also uses a cache mechanism, thus passing the text to annotate in the parameter of an executable docker wouldn't be the fastest option.

I've released a Python implementation of IAMsystem algorithm : https://github.com/scossin/iamsystem_python
Check it out and if you think it's worthwhile to integrate it in medkit I am available to help.

wow nice work ! It would be definitely interesting to have a medkit module !

coulet commented

@scossin : thanks a lot ! We'll come back to you soon to find the better way for integrating your library in medkit.

@scossin : sorry for the delay, it is now integrated in develop branch. It will be in next release.