This package provides some command line tools for working iwth GDELT data. It assumes that you are comfortable with Unix style IO Redirection. Note if you are using raw GDELT files you can port multiple files into stdin via wildcards (e.g. *.export.csv)
To use the aggregator right now use the following syntax from the command line:
python aggregator.py < data/train-sampled.csv --limit_rows 10000 > outfile.csv
-s
start year-e
end year-l
maximum number of rows to read in-p
character that splits columns in input-f
fill missing months with zeroes (True/False)-d
specify header from separate file--header_sep
like-p
but for the-d
file
python subsetter.py < gdelt-file.txt -d header.txt --country2 AFG > subset.tsv
-s
start year-e
end year-l
maximum number of rows to read in-p
character that splits columns in input-d
specify header from separate file--header_sep
like-p
but for the-d
file--country1
country desired in Actor1CountryCode column--country2
country desired in Actor2CountryCode column
To sample from a file:
perl -ne 'print if (rand() < .01)' large_input_file.txt > sampled_output.txt
Note that the 0.01 yields 1% of the rows.