Error creating the dataset
mikelewis0 opened this issue · 4 comments
Hi, I'm trying to follow the steps in the README to create the dataset. The first two steps seemed to work ok, but then I hit this error in combine_and_split. Can you tell me how to fix this?
350000 cc articles done, 2677/10200 clusters done
360000 cc articles done, 2706/10200 clusters done
Traceback (most recent call last):
File "combine_and_split.py", line 127, in
main(parser.parse_args())
File "combine_and_split.py", line 109, in main
clusters, args.cc_articles, id_to_cluster_idx, tmp_clusters_path
File "combine_and_split.py", line 38, in add_cc_articles_to_clusters
c.setdefault('cc_articles_filled', [])
AttributeError: 'NoneType' object has no attribute 'setdefault'
Will look into this ASAP!
Not 100% sure still why your error was caused, probably due to an article from Common Crawl stored multiple times. I reproduced it only by inserting duplicate articles in data/cc_storage/cc_articles.jsonl
. I made some changes in combine_and_split.py
that should prevent that bug in such a case. Also found another bug in combine_and_split.py
(only if --max-cluster-size
was used before). Let me know if that fixes it for you.
I get this error on the first step
Traceback (most recent call last): File "extract_wcep_articles.py", line 142, in <module> main(parser.parse_args()) File "extract_wcep_articles.py", line 118, in main write_jsonl(articles, outpath, mode='a') NameError: name 'write_jsonl' is not defined
I get this error on the first step
Traceback (most recent call last):
File "extract_wcep_articles.py", line 142, in
main(parser.parse_args())
File "extract_wcep_articles.py", line 118, in main
write_jsonl(articles, outpath, mode='a')
NameError: name 'write_jsonl' is not defined
Is fixed now.