opensemanticsearch/open-semantic-etl
Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
PythonGPL-3.0
Issues
- 0
Improve error message in case of Solr error
#161 opened by horde3d - 0
Neo4j crashed during import
#137 opened by NetwarSystem - 7
Ability to throttle overall ETL process?
#108 opened by NetwarSystem - 0
Flower not installed in team OVA
#113 opened by NetwarSystem - 3
Handling Maltego files correctly
#109 opened by NetwarSystem - 3
Extract amounts of money
#156 opened by opensemanticsearch - 0
Management of RDF properties / URIs
#155 opened by opensemanticsearch - 1
- 0
Unittest test_warc (test_enhance_warc.Test_enhance_warc) fails due to bug in pysolr
#154 opened by opensemanticsearch - 0
Solr exporter: Raise exception
#153 opened by Mandalka - 2
Docker container fails with AttributeError: 'Celery' object has no attribute 'worker_main'
#145 opened by itg-dave - 5
Upgrade to Tika 2.x
#142 opened by opensemanticsearch - 2
enhance_extract_text_tika_server.py fails unless headers=headers commented out
#150 opened by jgillum - 0
Remove scantailor
#151 opened by opensemanticsearch - 0
- 0
Setting Stemmer for unlisted languages
#139 opened by deeplearning101 - 0
- 2
etl_error_txt: Could not find stanford-ner.jar jar file at /usr/share/java/stanford-ner/stanford-ner.jar
#111 opened by aiscom - 0
Document Crawl have not changed for days
#134 opened by movanet - 0
Docker build of ETL image fails
#132 opened by soma-kurisu - 0
export_neo4j: cannot assign requested address on fresh docker-compose build and deploy
#131 opened by sharkymcdongles - 1
Performance bottleneck in solr?
#130 opened by dbsanfte - 1
- 0
Disable Tika-Python logs
#128 opened by Mandalka - 0
How can I download modules
#127 opened by vmsv - 1
Disable OCR / tesseract
#125 opened by nevermind2001 - 0
TSV to contenttype group spreadsheet
#126 opened by opensemanticsearch - 1
Extract law codes
#123 opened by opensemanticsearch - 0
Law code subreferences / taxonomy
#124 opened by opensemanticsearch - 1
Later stages configured: Add them if running opensemanticsearch-index-file
#122 opened by opensemanticsearch - 2
OCR of embedded images by Tika-Server
#121 opened by opensemanticsearch - 1
Upgrade Tika-Python
#120 opened by opensemanticsearch - 1
Dedupe content type group
#118 opened by opensemanticsearch - 2
- 1
If OCR in later stage, only start (re)process with enabled OCR, if (embedded) images
#116 opened by opensemanticsearch - 0
Adding apache manifoldcf to etl
#115 opened by kichenin - 1
ETL runtime stats more granular
#114 opened by opensemanticsearch - 11
Plugin core class
#112 opened by opensemanticsearch - 2
Maximum concurrent Tesseract jobs?
#110 opened by NetwarSystem - 1
Twitter import: Add linked websites to indexing queue only, if yet not in index
#102 opened by opensemanticsearch - 4
Indexing file system visible via nginx?
#107 opened by NetwarSystem - 1
failed tasks while.. on my python script.
#106 opened by hpiedcoq - 1
Option to disable automatic reindexing if configured new/addtional plugin
#105 opened by opensemanticsearch - 5
No additional ETL errors by following plugins, if main plugin failed
#104 opened by opensemanticsearch - 0
Twitter import: Date filter
#103 opened by opensemanticsearch - 2
Twitter scraper
#101 opened by opensemanticsearch - 6
Make hosts of microservices / REST-APIs configurable by environment variables
#100 opened by opensemanticsearch - 1
- 1
- 2