Only A unit words
sorami opened this issue · 8 comments
Prepare "A unit word only" resource.
chiVe 1.1
Dictionary: SudachiDict ver.20191030
- Extract A unit words (normalized forms for chiVe 1.1) from the dictionary
- Select only those words from the original chiVe
Extract A unit words
(Core dictionary = "small_lex.csv" + "core_lex.csv")
("Normalized Form" for chiVe 1.1)
import csv
a_words_normalized_form = set()
for fpath in ["SudachiDict/src/main/text/small_lex.csv",
"SudachiDict/src/main/text/core_lex.csv"]:
with open(fpath) as f:
for row in csv.reader(f, quotechar=None):
nf, unit = row[12], row[14]
assert unit in ("A", "B", "C")
if unit == "A":
a_words_normalized_form.add(nf)
len(a_words_normalized_form) # 439729
Select A unit word vectors
for fpath_chive in Path("chiVe/").glob("*/chive-1.1-*.txt"):
fname_out = fpath_chive.name.replace(".txt", "-a-unit-only.txt")
with open(fpath_chive, "r") as fin, open(fname_out, "w") as fout:
next(fin) # skip header
for line in fin:
word = line.split(" ")[0]
if word in a_words_normalized_form:
print(line, end="", file=fout)
Then, add the header line <#vocab> <#dimension>
for each file.
I have prepared the "A unit word only" chiVe.
I did make resources for all mc
editions (the number of minimum appearance count in the training corpus), just in case.
Each file contains the vectors in text format.
- chive-1.1-mc5-20200318-a-unit-only.tar.gz (465MB)
- chive-1.1-mc15-20200318-a-unit-only.tar.gz (400MB)
- chive-1.1-mc30-20200318-a-unit-only.tar.gz (350MB)
- chive-1.1-mc90-20200318-a-unit-only.tar.gz (274MB)
Please have a look!
Word count
all | A unit only | |
---|---|---|
v1.1 mc5 | 3,196,481 | 322,094 (10.1%) |
v1.1 mc15 | 1,452,205 | 276,866 (19.1%) |
v1.1 mc30 | 910,424 | 242,658 (26.7%) |
v1.1 mc90 | 480,443 | 189,775 (39.5%) |
Note: This does NOT contain OOV.
(The original chiVe comprised of "A unit words", "B unit words", "C unit words", and a lot of "OOV words")
Thank you! @sorami
I'd like to use v1.1 mc90 a-unit-only
for spaCy Japanese models.
Could you please add v1.1 mc90 a-unit-only
to the list of released archived in README.md
?
Sure
@hiroshi-matsuda-rit
Added! ae8ad5f