Allow topics to override primary category
mnahkies opened this issue · 2 comments
User Story
As a tool developer, I'd like to be able to override the category classification given to my tool. Specifically I'd like https://github.com/mnahkies/openapi-code-generator to be labelled as a "Code Generator" rather than a "Parser"
Context
Currently the category is assigned using https://www.npmjs.com/package/bayes which essentially uses the frequency of tokens in a provided text against the frequency of tokens in already classified text to assign a class.
However, because the current category/class distributions are pretty uneven (>30% are assigned to "Parsers") it seems to have ended up overly biasing assignment to "Parsers". For example, Redoc is assigned "User Interfaces" and "Parsers", but not "Documentation"
And these are all assigned to "Parsers" as well:
- OpenAPI Server Code Generator (oapi-codegen)
- OpenAPI Mocker
- docs
- php-openapi-faker
- ...
Rather than "Code Generator" / "Mock" / "Documentation" / "Testing Tools"
I'm not sure if this is inherent to the classification approach / problem space (eg: is the written language used for different types of tool lacking enough distinguishing tokens to give a good signal), or a negative feedback loop from the existing classifications, but either way I think it would be good to have a way to override this behavior.
I'm hopeful that introducing this would over time improve the accuracy of the classification using bayes
as a result of the accurate manually labelled data.
Detailed Requirement
Propose adding a way to manually label a primary category for a tool. I see two main options:
- Field on the
tools.yaml
entries likemanualCategoryOverride
- Looking for new topics on the source entries like the existing
openapi3
/openapi31
ones that indicate the primary category
I see the primary benefit of the first option being that it gives control of curation to the maintainers of this repository, whilst the second option allows tool writers to self serve. It's possible that both might be desirable, especially to account for entries that aren't scrapped from Github (though I guess their categories are essentially manually configured already).
I think some amount of rationalization (eg: Testing vs Testing Tools) of the existing categories may be useful as well, and potentially adding a description of each category explaining what is in/out of scope for it.
@SensibleWood do you have any thoughts on this? I'm open to attempting an implementation, but would appreciate some feedback on whether it would be likely to be accepted before investing the effort.