Is it possible to get how many texts summarized by the summarizer?
darwinharianto opened this issue · 7 comments
Suppose I have this kind of text
Check this out.
Everything checked out.
Not so much is checked.
I am not sure what is happening.
The dog is burnt.
Running the LSA algorithm with 3 sentences count gives me
Everything checked out.
Not so much is checked.
The dog is burnt.
Is it possible to get count of summarized text?
I assume it would look like this
Everything checked out. -> [Check this out., Everything checked out.] -> 2
Not so much is checked. -> [Not so much is checked., I am not sure what is happening.] -> 2
The dog is burnt. -> [The dog is burnt.] -> 1
Can I get this info from the SVD matrix?
Edit: wrong count for number for dog sentence
Hello, sorry but I can't see a pattern there. How do you determine which sentences you want to return for given summatised sentence? Once it is sentence before the second time after. Also, what are the numbers? I thought it's count of sentences in context but it is always 2 even for one sentence.
How do you determine which sentences you want to return for given summatised sentence? Once it is sentence before the second time after.
The order is following the input order
Check this out. -> 1st sentence
Everything checked out. -> 2nd sentence
Not so much is checked. -> 3rd sentence
I am not sure what is happening. -> 4th sentence
The dog is burnt. -> 5th sentence
The results would be
Everything checked out. -> close to 1st and 2nd sentence [Check this out., Everything checked out.] -> this sentence represent 2 sentences
Not so much is checked. -> close to 3rd and 4th sentence [Not so much is checked., I am not sure what is happening.] -> this sentence represent 2 sentences
The dog is burnt. -> not close to anyone[The dog is burnt.] -> this sentence represent itself
Also, what are the numbers? I thought it's count of sentences in context but it is always 2 even for one sentence
Ah sorry, I wrote the wrong number count
I am under the impression that LSA algorithm would only show the most distinct sentences and hide those that is already represented by other sentences. Is this correct?
Thank you for more info.
I am under the impression that LSA algorithm would only show the most distinct sentences and hide those that is already represented by other sentences. Is this correct?
Yes. you could say it like this I think. LSA works with concept of (very abstract) topics and tries to get representative sentences for them.
I believe if you say "close to 1st and 2nd sentence" you don't mean close as in sentence position but in vector space, right? You would like to know for every sentence in result summary the list of removed sentences it represents in the original text and how many of them. I am afraid there is no easy way how to get this info from the LSA summarizer. Summarizers are a black box and one can tweak them slightly sometimes. But this would require to create completely new one that also picks sentences as you described. But I don't really know how would I approach it.
If this is just one time thing maybe it is easier to use ChatGPT for this 😃
I believe if you say "close to 1st and 2nd sentence" you don't mean close as in sentence position but in vector space, right?
Ah, yes, from the vector space
You would like to know for every sentence in result summary the list of removed sentences it represents in the original text and how many of them. I am afraid there is no easy way how to get this info from the LSA summarizer. Summarizers are a black box and one can tweak them slightly sometimes. But this would require to create completely new one that also picks sentences as you described. But I don't really know how would I approach it.
Yes, I wasn't trying to tweak the LSA itself, I was thinking that maybe looking at the powered_sigma or v_matrix, one could make a relation with it (I believe this is the similarity matrix?).
Something that looks like this.
Maybe, given the similarity matrix, one could get the number of sentences represented by them using hierarchical clustering.
Oh, yes. If you are even willing to try some clustering algorithm and make your own modifications to LSA it is definitely doable. You know all the vectors so you can cluster them together. You even now initial cluster leaders (summarized sentences). It sounds like you know what you are doing so should be fine :)
You know all the vectors so you can cluster them together.
Yes, about this. I am not sure how to read this part. How can I get the vector matrices?
I believe it is somewhere over here ?
Which variable should I look at?
It is a bit more complicated. LSA gives you 2 matrices and a vector. I use only one of the matrices but their combination always have some meaning. You can check more in the documentation and link to original article from Steinberger and Jezek.
Here is a relevant result
Line 45 in 7fd4970
Here the computation of the sentences for the topics
Lines 119 to 120 in 7fd4970
Also keep in mind I implemented the library years ago and giving you advises from my poor memory and what is see in the code now. Can't dedicate more time to study the LSA details a to advise you more unfortunately.