theocjr/social-media-forensics

Classification

Closed this issue · 7 comments

scday commented

I've had great success running everything up until the classification point. I've run the n-grams script and now when I try to complete any of the classifiers I get this error "Traceback (most recent call last):
File "pmsvm_pca_classifier.py", line 253, in
authors_list = filter_authors(args.source_dir_data, args.min_tweets)
File "pmsvm_pca_classifier.py", line 114, in filter_authors
if threshold <= int(os.path.basename(filename).split('_')[0]):
ValueError: invalid literal for int() with base 10:"

This happens with each classifier. Any help would be greatly appreciated.

Hi @scday , I believe you're passing to --source-dir-data option a directory of tweets used in previous steps and not the one generated by --dest-dir option of ngrams_generator.py . Please double check if the directory passed to --source-dir-data option of the classifiers contains subdirectories with name pattern <number-of-tweets>_<userid> (like 03152_1929050916).

scday commented

I used this command for the ngrams ngrams_generator.py --source-dir my_input_dir --dest-dir my_output_dir --features all --debug . I changed the --dest-dir to my_ngrams_dir which created a folder for each author (I'm using you guys dataset) . In the folder for each author are about 11 files (see attached).
screenshot 2017-11-15 09 45 26

Is that not the correct item to pass in for the classification? Thanks again for your assistance.

Hi @scday . In our full pipeline (collecting data from Twitter and pre-processing the tweets), these author folders are generated with the pattern <number-of-tweets>_<userid>. These "number of tweets" accounting is done in the first pre-processing step in the filter_language_by_tweet.py code and is used to filter out authors with too few messages through the classification pipeline.

As I believe you've got this dataset from us (skipping some of the pipeline steps), we were obligated by Twitter terms to anonymize the data and I think is this the problem you are facing: the folders must have the pattern <number-of-tweets>_<userid> but the ones you have at hand only present an opaque text instead as user id. You could make a script rewriting each of these folders with this pattern ( <number-of-tweets>_<userid> ). You can get this number of tweets opening any .pkl file and getting the length of the array inside.

I hope have helped you.

scday commented

You have helped tremedously. The only step I did not follow was using the filter_language_by_tweet . Question if I go back and begin the process again and involve that step will that resolve everything or do you think it's just better to write and script that rename the files?

I suggest you begin the process again with that step involved.

Good luck.

scday commented

Ok. Thank you.