Datashare's user guide can be found here: https://icij.gitbook.io/datashare/
Datashare is a free open-source desktop application developed by non-profit International Consortium of Investigative Journalists (ICIJ).
Datashare allows investigative journalists to:
- access all their documents in one place locally on their computer while securing them from potential third-party interferences
- search pdfs, images, texts, spreadsheets, slides and any files, simultaneously
- automatically detect and filter by people, organizations and locations
You're welcome to suggest translations on Datashare's Crowdin https://crwd.in/datashare. Please contact us if you would like to add a language.
You can download the script at datashare.icij.org.
To access web GUI, go in your documents folder and launch path/to/datashare.sh
then connect datashare on http://localhost:8080
You can use the datashare docker container only for HTTP exposed name finding API.
Just run :
docker run -ti -p 8080:8080 -v /path/to/dist/:/home/datashare/dist icij/datashare:0.10 -m NER
A bit of explanation :
-w
tells datashare to run the webserver. It is launched on 8080 that's why the port is mapped for docker-m NER
runs datashare without index at all on a stateless mode-v /path/to/dist:/home/datashare/dist
maps the directory where the NLP models will be read (and downloaded if they don't exist)
Then query with curl the server with :
curl -i localhost:8080/ner/findNames/CORENLP --data-binary @path/to/a/file.txt
The last path part (CORENLP) is the framework. You can choose it among CORENLP, IXAPIPE, MITIE or OPENNLP.
Implementations
-
TikaDocument from ICIJ/extract
Apache Tika v1.18 (Apache Licence v2.0)
with Tesseract v4.0 alpha
Support
Implementations
-
org.icij.datashare.text.nlp.corenlp.CorenlpPipeline
Stanford CoreNLP v3.8.0, (Conditional Random Fields), Composite GPL v3+
-
org.icij.datashare.text.nlp.ixapipe.IxapipePipeline
Ixa Pipes Nerc v1.6.1, (Perceptron), Apache Licence v2.0
-
org.icij.datashare.text.nlp.mitie.MitiePipeline
MIT Information Extraction v0.8, (Structural Support Vector Machines), Boost Software License v1.0
-
org.icij.datashare.text.nlp.opennlp.OpennlpPipeline
Apache OpenNLP v1.6.0, (Maximum Entropy), Apache Licence v2.0
Natural Language Processing Stages Support
NlpStage |
---|
TOKEN |
SENTENCE |
POS |
NER |
Named Entity Recognition Language Support
NlpStage.NER |
ENGLISH |
SPANISH |
GERMAN |
FRENCH |
CHINESE |
---|---|---|---|---|---|
NlpPipeline.Type.CORENLP |
X | X | X | (w/ EN) | X |
NlpPipeline.Type.OPENNLP |
X | X | - | X | - |
NlpPipeline.Type.IXAPIPE |
X | X | X | - | - |
NlpPipeline.Type.MITIE |
X | X | X | - | - |
Named Entity Categories Support
NamedEntity.Category |
---|
ORGANIZATION |
PERSON |
LOCATION |
Parts-of-Speech Language Support
NlpStage.POS |
ENGLISH |
SPANISH |
GERMAN |
FRENCH |
---|---|---|---|---|
NlpPipeline.Type.CORE |
X | X | X | X |
NlpPipeline.Type.OPEN |
X | X | X | X |
NlpPipeline.Type.IXA |
X | X | X | X |
NlpPipeline.Type.MITIE |
- | - | - | - |
Implementations
-
org.icij.datashare.text.indexing.elasticsearch.ElasticsearchIndexer
Elasticsearch v6.1.0, Apache Licence v2.0
Requires
JDK 8,
Maven 3 and a running PostgreSQL database (hostname postgresql
)
with two databases datashare
and test
with write access for user test
/ password test
. You'll need also a running
elasticsearch instance with elasticsearch
as hostname ; and a redis server named redis
as well.
mvn validate
mvn -pl datashare-api -am install
mvn -pl datashare-db liquibase:update
mvn test
Datashare is released under the GNU Affero General Public License
We welcome feedback as well as contributions!
For any bug, question, comment or (pull) request,
please contact us at datashare@icij.org