pharmapsychotic/clip-interrogator

Text De-Duplication + Simplification via Sentence Similarity

torridgristle opened this issue · 1 comments

Sentence Similarity as defined by tokenizing the text and instead of running it through CLIP just average the tokens together. Don't include padding and start/end of text tokens. It might not be too different from running through CLIP to determine how similar they are, but it should be a lot faster.

There's some typos like "anime asthetic" and duplicate entries like "8k" and "8 k", and some of these are probably very similar like "octane render" and "octane renderer".

I believe a theoretically simple way of cutting it all down would be to tokenize them with CLIP and pair them with the original text in a dict for later reference, get the embeddings, and average the embeddings for each line of text. Then cut out anything that's too similar (not sure how to go about doing this part, like if 5 of them are similar then which do you choose to keep?) and grab the original text from the dict for the lines that were kept.

Yes good ideas! I have some WIP tooling on clustering the flavors which overlaps with this direction. Currently there are 100k flavors included but I have dataset with nearly a million but of course that gets quite slow to evaluate so have done some experiments to merge similar flavors like you describe. I was disappointed while testing so far though that a reduced set from the one million pitted against the current implementation didn't win out by much on the benchmark though. I didn't have much time to experiment more with it or other CLIP Interrogator stuff the past few weeks though.