- Create embeddings: LaBSE/USE/LASER.
- Create optimum clusters number with k-means and linear regression.
- Produse optimum clusters.
- Find nearest texts to center of each cluster.
- Summarize texts for each cluster.
- It is very important what data is being analyzed. If you take tweets, then without preprocessing everything will be pretty sad.
- Final summarisation does not work well. Tries to compose one from different news, instead of highlighting the essence of the collection. Instead of a sumarizer, it is better to use manual marking of the resulting categories with subsequent training of the classifier.