Finding any repetitive content in large audio file

Question

Finding any repetitive content in large audio file

Closed this issue 5 years ago · 4 comments

Hi,
Your work is awesome.
I just wanted to know if we can use this code to find repetitive content in a large audio file(say 7 hours). Is there any way so that I can match my file with itself to get portion which are getting repeated. If it is possible, it would be more than great if you can guide me to make changes which are required.

Thanks You!!

Answer 1 · 2019-05-31T08:35:06.000Z

Here’s one thing you could do: - break the 7h recording into ~800 one-minute, 50% overlapping segments. - build an audfprint database containing them all - run each one as a query against the database. Of course, it will match itself and overlapping files as the top hits, so ignore those, but any others will be actual repetitions. You’ll need to have —max-matches 5 or more to be able to see beyond those degenerate matches. DAn.

…

On Thu, May 30, 2019 at 14:21 bharat-patidar ***@***.***> wrote: Hi, Your work is awesome. I just wanted to know if we can use this code to find repetitive content in a *large audio file*(say 7 hours). Is there any way so that I can match my file with itself to get portion which are getting repeated. If it is possible, it would be more than great if you can guide me to make changes which are required. Thanks You!! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#63?email_source=notifications&email_token=AAEGZUKSZIO74HSDT6DQ3X3PYALKJA5CNFSM4HRIK472YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GWZGBOQ>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEGZUI5SYED7Q5723L5NEDPYALKJANCNFSM4HRIK47Q> .

Answer 2 · 2019-05-31T09:07:21.000Z

Okay, I will try this method.
Thanks for the help.

Answer 3 · 2019-10-29T23:28:54.000Z

@dpwe What's the rationale for splitting the recording into overlapping segments?

Answer 4 · 2019-10-30T02:15:09.000Z

Matching a very long recording is inefficient because a long recording will include nearly every hash somewhere, so the first-pass pruning by common matches won't do much, and the list of matching hashes that have to be sorted by time difference will be very large. Typically, you are interested in knowing roughly where a match occurs; if you break material up into shorter segments, then you get some of that information just from knowing the matching segment. But if you have no idea where the targets are going to occur, there's a chance that arbitrary chopping up will chop in the middle of the matching region, making it less likely to find the match at all (since you only have, worst case, half the duration to match in the two resulting halves). However, with 50% overlapped segments, there's always a segment centered over the split point, so if the split falls into a match region, the overlapped-segment will have the match squarely in the middle, for the best chance of matching. So, segments should be longer than the excerpts you expect to match. DAn.

…

On Tue, Oct 29, 2019 at 7:29 PM Brendon Wong ***@***.***> wrote: @dpwe <https://github.com/dpwe> What's the rationale for splitting the recording into overlapping segments? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63?email_source=notifications&email_token=AAEGZUPUCOJP5C5P7FL473DQRDBLPA5CNFSM4HRIK472YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECSNO3I#issuecomment-547673965>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEGZUKMB63OQF7JAYWT2I3QRDBLPANCNFSM4HRIK47Q> .