error on 01_parse.py
danielmoore19 opened this issue · 2 comments
UnboundLocalError: local variable 'output_file' referenced before assignment
Traceback (most recent call last):
File "scripts/01_parse.py", line 61, in <module>
plac.call(main)
File "/usr/local/lib/python3.6/dist-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/usr/local/lib/python3.6/dist-packages/plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "scripts/01_parse.py", line 53, in main
with output_file.open("wb") as f:
when i looked in the parse.py code, it appears that output_file is not always created:
for doc in tqdm.tqdm(docs, desc="Docs", unit=""):
if count < max_docs:
doc_bin.add(doc)
count += 1
else:
batch_num += 1
count = 0
msg.good(f"Processed {len(doc_bin)} docs")
doc_bin_bytes = doc_bin.to_bytes()
output_file = output_path / f"{input_path.stem}-{batch_num}.spacy"
with output_file.open("wb") as f:
f.write(doc_bin_bytes)
msg.good(f"Saved parsed docs to file", output_file.resolve())
doc_bin = DocBin(attrs=["POS", "TAG", "DEP", "ENT_TYPE", "ENT_IOB"])
with output_file.open("wb") as f:
so if the doc count is lower than the max_docs
setting, output_file
is never created. obviously it is simple to reduce the max_doc
setting and force the else
chain. but it would seem the output_file
should always be created, correct?
I noticed this too. I think the issue is actually the second with
being too early. Currently it's written as
with input_path.open("r", encoding="utf8") as texts:
docs = nlp.pipe(texts, n_process=n_process)
for doc in tqdm.tqdm(docs, desc="Docs", unit=""):
if count < max_docs:
doc_bin.add(doc)
count += 1
else:
batch_num += 1
count = 0
msg.good(f"Processed {len(doc_bin)} docs")
doc_bin_bytes = doc_bin.to_bytes()
output_file = output_path / f"{input_path.stem}-{batch_num}.spacy"
with output_file.open("wb") as f:
f.write(doc_bin_bytes)
msg.good(f"Saved parsed docs to file", output_file.resolve())
doc_bin = DocBin(attrs=["POS", "TAG", "DEP", "ENT_TYPE", "ENT_IOB"])
with output_file.open("wb") as f:
batch_num += 1
output_file = output_path / f"{input_path.stem}-{batch_num}.spacy"
doc_bin_bytes = doc_bin.to_bytes()
f.write(doc_bin_bytes)
msg.good(
f"Complete. Saved final parsed docs to file", output_file.resolve()
)
Where it should be
with output_file.open("wb") as f:
f.write(doc_bin_bytes)
msg.good(f"Saved parsed docs to file", output_file.resolve())
doc_bin = DocBin(attrs=["POS", "TAG", "DEP", "ENT_TYPE", "ENT_IOB"])
batch_num += 1
output_file = output_path / f"{input_path.stem}-{batch_num}.spacy"
doc_bin_bytes = doc_bin.to_bytes()
with output_file.open("wb") as f:
f.write(doc_bin_bytes)
msg.good(f"Complete. Saved final parsed docs to file", output_file.resolve())
The current code actually won't save the last output_file
regardless of the doc count because the last output file is never opened (that last with output_file
either opens the second to last output_file
or you get the UnboundLocalError
. I'm submitting a pull request to fix it.
this solves issue. follow ericfeunekes post.