[bug] encoding阶段生成的codec.txt, 无法直接读取?
zyy-fc opened this issue · 1 comments
按照提供的encoding_decoding.sh脚本,encoding阶段会生成codec.txt文件
这个文件的形式类似于:
utts_id "空格" json.dumps(codecs)
这个形式无法被read_text.py直接读取,需要改写“load_jsonl_trans_int”函数,如下
def load_jsonl_trans_int(path: Union[Path, str]) -> Dict[str, np.ndarray]: d = read_2column_text(path) retval = {} for k, v in d.items(): try: value = json.loads(v) if isinstance(value, dict): retval[k] = np.array(value["trans"], dtype=int) elif isinstance(value, list): retval[k] = np.array(value, dtype=int) else: raise TypeError except TypeError: logging.error(f'Error happened with path="{path}", id="{k}", value="{v}"') raise return retval
Thanks for your report. As expect, the codec.txt
should be loaded with load_codec_json
function in funcodec/datasets/iterable_dataset.py
. The function load_jsonl_trans_int
in read_text.py
is not used to load codec tokens. Thanks for your modification as well.