lintool/twitter-tools

Using the Collection with Terrier

maiststudent opened this issue · 1 comments

Hello,

I'm trying to index the corpus using Terrier. But according to
http://ir.dcs.gla.ac.uk/wiki/Terrier/Tweets11, I need the collection
in JSON format first. I'm using the HTML collection

Where do I find the HTML scrapper that the page mentions to write out
the collection in JSON format? And how would I go about using it?

Thank you,

Look at the ReadStatuses demo, that should help you see how to write out the HTML collection in another format. Alternatively, look at IndexStatuses demo, and teach Terrier to read directly from the HTML sequence files.