Finding any repetitive content in large audio file
Closed this issue · 4 comments
bharat-patidar commented
Hi,
Your work is awesome.
I just wanted to know if we can use this code to find repetitive content in a large audio file(say 7 hours). Is there any way so that I can match my file with itself to get portion which are getting repeated. If it is possible, it would be more than great if you can guide me to make changes which are required.
Thanks You!!
dpwe commented
Here’s one thing you could do:
- break the 7h recording into ~800 one-minute, 50% overlapping segments.
- build an audfprint database containing them all
- run each one as a query against the database. Of course, it will match
itself and overlapping files as the top hits, so ignore those, but any
others will be actual repetitions.
You’ll need to have —max-matches 5 or more to be able to see beyond those
degenerate matches.
DAn.
…On Thu, May 30, 2019 at 14:21 bharat-patidar ***@***.***> wrote:
Hi,
Your work is awesome.
I just wanted to know if we can use this code to find repetitive content
in a *large audio file*(say 7 hours). Is there any way so that I can
match my file with itself to get portion which are getting repeated. If it
is possible, it would be more than great if you can guide me to make
changes which are required.
Thanks You!!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#63?email_source=notifications&email_token=AAEGZUKSZIO74HSDT6DQ3X3PYALKJA5CNFSM4HRIK472YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GWZGBOQ>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEGZUI5SYED7Q5723L5NEDPYALKJANCNFSM4HRIK47Q>
.
bharat-patidar commented
Okay, I will try this method.
Thanks for the help.
brendon-wong commented
@dpwe What's the rationale for splitting the recording into overlapping segments?
dpwe commented
Matching a very long recording is inefficient because a long recording will
include nearly every hash somewhere, so the first-pass pruning by common
matches won't do much, and the list of matching hashes that have to be
sorted by time difference will be very large. Typically, you are
interested in knowing roughly where a match occurs; if you break material
up into shorter segments, then you get some of that information just from
knowing the matching segment. But if you have no idea where the targets
are going to occur, there's a chance that arbitrary chopping up will chop
in the middle of the matching region, making it less likely to find the
match at all (since you only have, worst case, half the duration to match
in the two resulting halves). However, with 50% overlapped segments,
there's always a segment centered over the split point, so if the split
falls into a match region, the overlapped-segment will have the match
squarely in the middle, for the best chance of matching.
So, segments should be longer than the excerpts you expect to match.
DAn.
…On Tue, Oct 29, 2019 at 7:29 PM Brendon Wong ***@***.***> wrote:
@dpwe <https://github.com/dpwe> What's the rationale for splitting the
recording into overlapping segments?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#63?email_source=notifications&email_token=AAEGZUPUCOJP5C5P7FL473DQRDBLPA5CNFSM4HRIK472YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECSNO3I#issuecomment-547673965>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEGZUKMB63OQF7JAYWT2I3QRDBLPANCNFSM4HRIK47Q>
.