Gagravarr/VorbisJava

ClassCastError on opening OGG video

Closed this issue · 18 comments

Exception in thread "main" java.lang.ClassCastException: org.gagravarr.vorbis.VorbisAudioData cannot be cast to org.gagravarr.vorbis.VorbisInfo
    at org.gagravarr.vorbis.VorbisFile.<init>(VorbisFile.java:78)
    at org.gagravarr.vorbis.VorbisFile.<init>(VorbisFile.java:55)
    at OggBug.main(OggBug.java:10)

can be reproduced by downloading http://mirror.bigbuckbunny.de/peach/bigbuckbunny_movies/big_buck_bunny_720p_stereo.ogg and using this code to load it:

public class OggBug {
  public static void main(String[] args) throws Exception {
    FileInputStream fin = new FileInputStream("/Users/fabian/Downloads/big_buck_bunny_720p_stereo.ogg");
    OggFile ogg = new OggFile(fin);
    VorbisFile vorbis = new VorbisFile(ogg);
    System.out.println(vorbis);
  }
}

Are you able to find / produce a much smaller (sub 1mb) video file that reproduces the problem? We'll really need a test file to go with any fix + unit test, but I don't really fancy committing a ~200mb file to the repo to try to test against...

I understand. But you are not saying that you ignore the problem until you have a smaller file? It should be possible to find and fix the issue even when we cannot find a smaller video which also fails?
Fabian

here a 400k file, which also has the problem
http://techslides.com/demos/sample-videos/small.ogv

http://playground.html5rocks.com/samples/html5_misc/chrome_japan.ogv doesnt work as well. Are you actually having any ogg video which works that way?

Maybe it is not supposed to be created a VorbisFile out of it? We got there from the tika parser:

Caused by: java.lang.ClassCastException: org.gagravarr.vorbis.VorbisAudioData cannot be cast to org.gagravarr.vorbis.VorbisInfo
at org.gagravarr.vorbis.VorbisFile.(VorbisFile.java:78) ~[vorbis-java-core-0.1.jar:na]
at org.gagravarr.vorbis.VorbisFile.(VorbisFile.java:55) ~[vorbis-java-core-0.1.jar:na]
at org.gagravarr.tika.VorbisParser.parse(VorbisParser.java:58) ~[vorbis-java-tika-0.1.jar:na]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ~[tika-core-1.5.jar:na]

maybe the vorbis parser should not be used?

Looks like it originates from here:

else if(streams > 0) {
// Something else...
// TODO Detect video
}

the Detector claims it to be general application/ogg.
Then Tika MimeTypes detector comes along and says "oh i know its audio/ogg, and thats better than application/ogg" so i go with it.
Would it be ok to return OGG_VIDEO in the todo part? that would prevent tika from overruling

Rather than blindly returning OGG_VIDEO, it should probably be updated to detect the various kinds of video streams above, so that it could then (say) have a check like if (theora_streams > 0 || dirac_streams > 0) { return OGG_VIDEO; }. That'd also want a unit test or two, hence the need for some very small test files!

is the 400k file good enough?
I also proposed to tika to fix the mime magic detection so that application/ogg does not incorrectly get overwritten by audio/ogg

Do you know what license the small.ogv or chrome_japan.ogv files are under? A 400kb file is probably alright, as long as it's under a license where we can distribute it!

maybe take them from here: https://wiki.xiph.org/TheoraTestsuite they are explicitly intended for testing.

if in doubt, why not include an automatic download in the pom? that way you are not distributing the file with your source.

OK, I'll have a play over the weekend, and see what I can manage. (I've got a plan now, just need the time to implement it!)

If you have a spare little bit of time, any chance you could review the Ogg Checksum code, and see if you can work out why the code is generating warnings? The spec says you should ignore packets which don't have a valid checksum, but I'm reluctant to do that until I'm sure the code calculates them correctly! (Your TIKA-1112 will need this fix)

I dont care about TIKA-1112, but I have a bit of time today, so I will look into the checksumming.

I had limited time, what i noticed is that the checksum is "long" while the crc value is "int".
Also i am puzzled by the ogg documentation which says "LSb of LSB first.", but the value for sequence number seems to be ok.

Any chance you could grab the latest code from git, build, bump the dependency in tika parsers to 0.4-snapshot, and test?

I believe it's now fixed, and with a sample theora file I'm seeing:

$ java -jar tika-app-1.6-SNAPSHOT.jar --metadata chrome_japan.ogv Content-Length: 7868057 Content-Type: video/theora resourceName: chrome_japan.ogv streams-annodex: 1 streams-audio: 1 streams-metadata: 1 streams-theora: 1 streams-total: 3 streams-video: 1 streams-vorbis: 1

All my ogg video files now parse correctly. The checksumming still is somehow broken, but as it only produces parse warnings, i am happy with it so far. I tried to look at the checksumming but besides possible type conversion problems I could not find any problems according to spec.

OK, I've released v0.4, and upgraded Tika to use it, so I believe we're now all sorted for this

(Issue 5 has been opened to track the checksum problem)