/flounder

Flounder is an old corpus collector I wrote, but it still works. Just need a Bing API key

Primary LanguagePython

Usage

First generate a list of URLs using a Bing search with flounder:

python3 flounder.py <bing api key> <bing api search term>

Then download the files:

python3 download.py <founder URL list> <file extension>

Download.py

download.py implements a filter in Python to delete files which do not match a given criteria. Right now it's an example that has an RTF filter, but feel free to add whatever filter you dream of.

Cool features

  • It downloads things in parallel.
  • It cycles through market codes, allowing for getting non-local-locale input files, which is great for getting good unicode inputs, which usually helps stress localization code in target software

Example

python3 flounder.py YOURBINGAPIKEY filetype:rtf
python3 download.py urllog_1555546041.8930097.txt

And that's it! The urllog file varies based on the time so, and download.py just downloads all files in the urllog and filter them accordingly (in this case checks for an RTF header)

Dependencies

You need requests python3 -m pip install requests