WorksApplications/chiVe

Only A unit words

sorami opened this issue · 8 comments

Prepare "A unit word only" resource.

chiVe 1.1

Dictionary: SudachiDict ver.20191030

  1. Extract A unit words (normalized forms for chiVe 1.1) from the dictionary
  2. Select only those words from the original chiVe

Extract A unit words

(Core dictionary = "small_lex.csv" + "core_lex.csv")

("Normalized Form" for chiVe 1.1)

import csv

a_words_normalized_form = set()

for fpath in ["SudachiDict/src/main/text/small_lex.csv", 
                     "SudachiDict/src/main/text/core_lex.csv"]:

    with open(fpath) as f:
        for row in csv.reader(f, quotechar=None):
            nf, unit = row[12], row[14]
            assert unit in ("A", "B", "C")
            if unit == "A":
                a_words_normalized_form.add(nf)

len(a_words_normalized_form) # 439729

Select A unit word vectors

for fpath_chive in Path("chiVe/").glob("*/chive-1.1-*.txt"):

    fname_out = fpath_chive.name.replace(".txt", "-a-unit-only.txt")
    with open(fpath_chive, "r") as fin, open(fname_out, "w") as fout:
        next(fin) # skip header
        for line in fin:
            word = line.split(" ")[0]
            if word in a_words_normalized_form:
                print(line, end="", file=fout)

Then, add the header line <#vocab> <#dimension> for each file.

@hiroshi-matsuda-rit

I have prepared the "A unit word only" chiVe.

I did make resources for all mc editions (the number of minimum appearance count in the training corpus), just in case.

Each file contains the vectors in text format.

  1. chive-1.1-mc5-20200318-a-unit-only.tar.gz (465MB)
  2. chive-1.1-mc15-20200318-a-unit-only.tar.gz (400MB)
  3. chive-1.1-mc30-20200318-a-unit-only.tar.gz (350MB)
  4. chive-1.1-mc90-20200318-a-unit-only.tar.gz (274MB)

Please have a look!

Word count

all A unit only
v1.1 mc5 3,196,481 322,094 (10.1%)
v1.1 mc15 1,452,205 276,866 (19.1%)
v1.1 mc30 910,424 242,658 (26.7%)
v1.1 mc90 480,443 189,775 (39.5%)

Note: This does NOT contain OOV.

(The original chiVe comprised of "A unit words", "B unit words", "C unit words", and a lot of "OOV words")

Thank you! @sorami
I'd like to use v1.1 mc90 a-unit-only for spaCy Japanese models.

Could you please add v1.1 mc90 a-unit-only to the list of released archived in README.md?

Sure