Minethat is a new kind of ETL dedicated to text mining.
This project has been discontinued, but feel free to contact me if you wish to use some of the code! To get details about the project, see PIVOT.md
It contains a web server (in Node.js) and some background services (update of data, mining services... in Java).
To get started, run make
in your terminal:
make
In order to have the whole Minethat system running, you should clone in the main directory (this repo) two other repositories: web and corpora.
/conf # Configuration
/datasets # Datasets used by java services
/java-apps # Java services
/logs # All logs for all processes and apps
/utils # Libraries
- Mongo DB
- RabbitMQ
node web/src/server/aggregator.js
- corpora/corpora -r datasets/corpora
- java-apps/dist/bin/mail_service
- java-apps/dist/bin/extractor_service
- java-apps/dist/bin/miner_service
node web/src/server/index.js
Uses gulp.js for build:
$ cd web-server
$ gulp
Use gulp watch
while working to refresh files.
This module serves:
- Homepage (/)
- Web application (/app)
- Blog (/blog)
- Developer center (/developers)
- Private documentations (/private)
-
Root folder "static" is automatically generated from "src/static"
-
How to add new user/pass in users.htpasswd:
npm install -g htpasswd
htpasswd -bc users.htpasswd user pass
- Logging: log4j 2
- Testing/code quality:
- jUnit
- findbugs, PMD, checkstyle, cobertura
- In IDE code analysis: IntellijIDEA code analyzer
- NLP: Apache OpenNLP, Stanford POSTagger & NER
- Text extraction: Apache Tika, PDFBox
- HTML parsing: Jsoup
- MaxMind GeoIP 2
- OpenRDF
MailInputService and web-server generate some Jobs, save them in MongoDB, get the ID, and submit ID to queue input service that will run the job and process each document in it.
MailInputService >
> ExtractorService > MinerService
Web-server >
These tasks are to be done before first deployment:
- Configure log4j file appenders so it targets log files
- Configure log4j mongodb appender
- Configure tracer file appenders + mongodb appenders
- Configure MongoDB replication (x2)
- Bug UI
- Builds UI
- Customer history UI
- SSL
- OAuth
We believe that text-mining should be simple and accessible.
Here are a few examples of use:
- Rate website comments, propositions commerciales
- Annotate your content
- Visit cards
- Network of people
- Email footer
- Blog
- Reddit (r/linguistics/, r/MachineLearning/, r/LanguageTechnology/, r/compsci/, r/statistics/, r/opendata, r/startups)
Just drag a text file (PDF, Word, Markdown...) and wait for the result. You're developer? We have some APIs for you.
We use the best-in-class open source solutions in a modular way, letting you select what mining operation you want to run on texts. Once submitted, your text will be streamed accross dozens of processors that will analyse the text and annotate it.
Minethat utilities relies on different tools.
Text mining core services - core of Minethat offer - relies on a Java service. Main reason is the high number of open source and licensed Java APIs dedicated to various text mining tasks. All code lives in java-apps — IntellijIDEA project included.
All APIs are exposed through Node.js servers. Node is particularly efficient in serving stuff at any scale.
- Design to scale
- Code grammar nazi
- Save time and money
- Gain knowledge
- Improve your writing
- Text annotation
- Sentiment analysis
- Trend discovery
- Documents encryption
- SDKs: Java, Node.js, Python
- 3 APIs (Mail, REST, web) + Chrome Extension
- We are minethat, a compagny that aim to allow everyone to better use and understand textual content.
- With our tool, customers benefit some really actionable metrics (quality, statictics, anotations).
- Simplicity.
- Pricing.
By subscribing to Business plans, you automatically benefit the premium support access.
Premium support includes:
- Email tickets within 12 hours, 24/7
Startup and business owners
| | Basic | Startup | Business | |--------------------------------------------------------| | Documents/month | 10 | 1000 | Unlimited | | Web app submission | x | x | x | | Email submission | | x | x | | API submission | | x | x | | Premium support | | | x | | Initial training | | | x | | Price | Free | $49/m | $499/m |
Right now we fully support english and french languages. We work hard in order to soon provide chinese, japanese, as well as german and spanish.
Three ways:
- manually through our web application (app.minethat.com)
- programatically using our REST API
- or just send us your text by email, we'll send you back the result in minutes
When you submit a document, a Job is automatically created and queued in our stream processing infrastructure. The document will go through different kind of processors, that will split the text into simple analyzable senquences of tokens. Once all processors are done, the job is
Definitely yes. We use corpuses based on content from Wikipedia, Google, New York Times, and more. You can also submit your own corpuses for your custom classification process.
For Enterprise plans, we ensure that our infrastructure has an availabity rate of 99.90%.