Continuous checks and evaluation of HTML translation feature
jerinphilip opened this issue · 2 comments
Translating HTML to provide value to the user is doable and done here quite well. Translating all sorts of HTML with error correction is a decent research problem. This issue documents what is already scattered over several internal messaging into this public document for future reference and visibility.
Problem There are at least two aspects to this issue:
- Is our mechanism able to handle all sorts of HTML thrown at it without crashing? If crashing, do we have means to communicate to other consumers (looking at you, WebAssembly) to handle the crash gracefully.
- Are the rules we encode here the best across the existing noise/corruption in real-world HTML? Treating HTML elements as word breaking while it works on some also miserably fails in other aspects. Currently, we are engineering a rule-based system for error correction, assuming malformed HTML (#286 (comment)). While I still doubt whether bergamot-translator should have taken this up, the HTML feature appears to have reached a satisfactory state.
We however remain without consensus on whether what we are doing is better than the existing setting, whether one HTML assumption is better than the other except for the developer's instincts based on experience.
A skeleton solution The design of infrastructure to know better could be to obtain a representative sample of noisy real-world HTML, then correct it by experts to create an evaluation dataset. Then, create a few metrics which we consider valuable and continuously look at a scalar score constructed aggregating the said metrics over the evaluation dataset.
An existing implementation https://github.com/jerinphilip/tagtransfer is an exploratory undertaking towards the above problem in pursuit of setting up the infrastructure. It's in python and richer in HTML parsing and validation and debugging tools, unlike WebAssembly. So we can expand on to:
- Crawl many web pages and ensure the HTML translation mechanism doesn't crash. I don't believe we can handle all invalid user input, but if we hit something like 95% of exemplary web pages with no crash, we can either shift blame to bad developers or have them correct their HTML. There is a manual google-translate website like mechanism in place, but it is straightforward to make this automated.
- Use an already existing XML dataset and evaluation data to provide a straightforward array of metrics for now. We can in the future enhance this by allowing for force-decoding and restricting the scope to evaluating the HTML algorithm alone.
Alternate ideas, improvements and suggestions are welcome and much appreciated.
I started implementing some tests in Python, particularly for the things I'm focussing on with parsing & restoring HTML: https://colab.research.google.com/drive/1asuIT1OffBxKz-88pQrGDBxgmYWVvF6J?usp=sharing
@jerinphilip I'd like to add this to CI somehow, not as a pass-or-fail test but as a "hey this get's a score of N" type of job. It would help with comparing #312 (and future ones like that) to main. Could you help with adding this to CI?
Edit: to clarify, I'm thinking of something a bit more like 2. We have a standard set of pages where we write measures for (e.g. like the ones in my colab example) and then report scores for each of those measures per push/pull request.