complementizer/wcep-mds-dataset

Error creating the dataset

mikelewis0 opened this issue · 4 comments

Hi, I'm trying to follow the steps in the README to create the dataset. The first two steps seemed to work ok, but then I hit this error in combine_and_split. Can you tell me how to fix this?

350000 cc articles done, 2677/10200 clusters done
360000 cc articles done, 2706/10200 clusters done
Traceback (most recent call last):
File "combine_and_split.py", line 127, in
main(parser.parse_args())
File "combine_and_split.py", line 109, in main
clusters, args.cc_articles, id_to_cluster_idx, tmp_clusters_path
File "combine_and_split.py", line 38, in add_cc_articles_to_clusters
c.setdefault('cc_articles_filled', [])
AttributeError: 'NoneType' object has no attribute 'setdefault'

Will look into this ASAP!

Not 100% sure still why your error was caused, probably due to an article from Common Crawl stored multiple times. I reproduced it only by inserting duplicate articles in data/cc_storage/cc_articles.jsonl. I made some changes in combine_and_split.py that should prevent that bug in such a case. Also found another bug in combine_and_split.py (only if --max-cluster-size was used before). Let me know if that fixes it for you.

I get this error on the first step

Traceback (most recent call last):
  File "extract_wcep_articles.py", line 142, in <module>
    main(parser.parse_args())
  File "extract_wcep_articles.py", line 118, in main
    write_jsonl(articles, outpath, mode='a')
NameError: name 'write_jsonl' is not defined

I get this error on the first step

Traceback (most recent call last):
File "extract_wcep_articles.py", line 142, in
main(parser.parse_args())
File "extract_wcep_articles.py", line 118, in main
write_jsonl(articles, outpath, mode='a')
NameError: name 'write_jsonl' is not defined

Is fixed now.