Classification

Question

Classification

Closed this issue 7 years ago · 7 comments

I've had great success running everything up until the classification point. I've run the n-grams script and now when I try to complete any of the classifiers I get this error "Traceback (most recent call last):
File "pmsvm_pca_classifier.py", line 253, in
authors_list = filter_authors(args.source_dir_data, args.min_tweets)
File "pmsvm_pca_classifier.py", line 114, in filter_authors
if threshold <= int(os.path.basename(filename).split('_')[0]):
ValueError: invalid literal for int() with base 10:"

This happens with each classifier. Any help would be greatly appreciated.

Answer 1 · 2017-11-15T10:37:02.000Z

Please check.

…

-- Anderson ************************************************ Prof. Dr. Anderson Rocha Associate Director, Institute of Computing UNIVERSITY OF CAMPINAS, SP - BRAZIL Digital Forensics and Machine Intelligence http://www.ic.unicamp.br/~rocha ************************************************

On Nov 15, 2017, 3:07 AM -0200, SCDay ***@***.***>, wrote: I've had great success running everything up until the classification point. I've run the n-grams script and now when I try to complete any of the classifiers I get this error "Traceback (most recent call last): File "pmsvm_pca_classifier.py", line 253, in authors_list = filter_authors(args.source_dir_data, args.min_tweets) File "pmsvm_pca_classifier.py", line 114, in filter_authors if threshold <= int(os.path.basename(filename).split('_')[0]): ValueError: invalid literal for int() with base 10:" This happens with each classifier. Any help would be greatly appreciated. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

Answer 2 · 2017-11-15T13:26:56.000Z

Hi @scday , I believe you're passing to --source-dir-data option a directory of tweets used in previous steps and not the one generated by --dest-dir option of ngrams_generator.py . Please double check if the directory passed to --source-dir-data option of the classifiers contains subdirectories with name pattern <number-of-tweets>_<userid> (like 03152_1929050916).

Answer 3 · 2017-11-15T14:46:52.000Z

I used this command for the ngrams ngrams_generator.py --source-dir my_input_dir --dest-dir my_output_dir --features all --debug . I changed the --dest-dir to my_ngrams_dir which created a folder for each author (I'm using you guys dataset) . In the folder for each author are about 11 files (see attached).

Is that not the correct item to pass in for the classification? Thanks again for your assistance.

Answer 4 · 2017-11-15T15:22:38.000Z

Hi @scday . In our full pipeline (collecting data from Twitter and pre-processing the tweets), these author folders are generated with the pattern <number-of-tweets>_<userid>. These "number of tweets" accounting is done in the first pre-processing step in the filter_language_by_tweet.py code and is used to filter out authors with too few messages through the classification pipeline.

As I believe you've got this dataset from us (skipping some of the pipeline steps), we were obligated by Twitter terms to anonymize the data and I think is this the problem you are facing: the folders must have the pattern <number-of-tweets>_<userid> but the ones you have at hand only present an opaque text instead as user id. You could make a script rewriting each of these folders with this pattern ( <number-of-tweets>_<userid> ). You can get this number of tweets opening any .pkl file and getting the length of the array inside.

I hope have helped you.

Answer 5 · 2017-11-15T15:27:44.000Z

You have helped tremedously. The only step I did not follow was using the filter_language_by_tweet . Question if I go back and begin the process again and involve that step will that resolve everything or do you think it's just better to write and script that rename the files?

Answer 6 · 2017-11-15T15:37:52.000Z

I suggest you begin the process again with that step involved.

Good luck.

Answer 7 · 2017-11-15T15:47:37.000Z

Ok. Thank you.