A system able to estimate the relevance of an arbitrary content towards the learned categories.
The system is able to score the unseen document by its content (and potentially other attributes) based on its contextual similarity to the seen ones. It also contains the score tuning mechanism enabling the direct use of the documents' relevance scores by a search engine filtering the relevant/irrelevant results by a single fixed threshold and easily reaching the optimal performance.
The system will be integrated into RH content search services using DCP content indexing tool.
It might also be further extended to provide a smart content-based recommender system for web portals with sufficient amount of training documents (regardless of categorization).
The project currently contains two main components:
- Deployable search service providing intuitive REST API for scoring an arbitrary content towards the trained categories:
Request:
{
"sys_meta": false,
"doc": {
"id": "DOC_123",
"title": "One smart doc",
"content": "This is one dull piece of text."
}
}
Response:
{
"scoring": {
"softwarecollections": 0.000060932611962771777,
"brms": 0.00080337037910394038,
"bpmsuite": 0.00026477703963384558,
"...": "..."
}
}
- Content downloader providing tools for convenient bulk download of the indexed content (of DCP and access.redhat) categorized towards the Red Hat products.
In addition to that, the project contains the analytical part that has driven the selection of the classifier and configuration of the system parameters.
The architecture and the technologies used are briefly introduced in overview presentation and slightly technical presentation.
If you're interested in technical background of the project, try to understand the technical documentation of the system.
Various further evaluation of the current system by some more tricky metrics are summed up in the most fresh analysis.
The overall progress and objectives of the project are tracked here.