Walleclipse/ChineseAddress_OCR

Can you share how to generate the following file?

AnddyWang opened this issue · 6 comments

Can you share how to generate the following file? Looking forward to your reply. Thanks very much.

full_address1.csv
so_stupid_smart_adrs_lib_fuck.me.txt
strokes.txt

  1. full_address1.csv
    It is just downloaded from internet. I search "Chinese address libarary" or "中文地址库" in Baidu, and get them.
  2. so_stupid_smart_adrs_lib_fuck.me.txt
    It is based on full_address1.csv,generated by concatenate of several level address lines. For example, "河南省" is first level address, "洛阳市" is second level, "宜阳县" is third level and "穆册乡" is fourth level address accroding the full_address1.csv. I generate the all possibles of theses address types. such as: "河南省" + "洛阳市" (first level + second level), "洛阳市" + "宜阳县"( second level + third level),"河南省" + "洛阳市" + "宜阳县" (first level + second level + third level.) Then , I sorted in by length of address string. PS: It is so stupid method to generate address library, but it is helpfull for fuzzy matching according the length of address.
  3. strokes.txt
    This Chinese Strike table(中文笔画表) is downloaded by Internet.

非常感谢您的回复,我试了下‘吉林省长白山保泸开发区管理委员会池北区’(错了个护字),这个阈值设置到95,总是纠正成‘吉林省长春市汽车开发区管理委员会池北区’,如果阈值设置到99,就原样输出了,这个怎么修复下呢?

你好,在地址库 “so_stupid_smart_adrs_lib_fuck.me.txt” 中只有“吉林省白山市抚松县长白山保护开发区管委会池北区”,没有“吉林省长白山保护开发区”。也就是“吉林省长白山保护开发区” 这是个一级地址+四级地址组成的。而我构造所“so_stupid_smart_adrs_lib_fuck.me.txt” 的时候没有考虑跳过2,3级地址的情况,所以无法修正。你可以重新构造类似的地址库,包括跳过中间级别地址的情况。当然这只是最naïve的解决方案,更有效的方法我也还没想到。

好的,非常感谢,下边的两个文件,您还有对应的链接吗,可以发下吗?

  1. full_address1.csv
    It is just downloaded from internet. I search "Chinese address libarary" or "中文地址库" in Baidu, and get them.
  2. strokes.txt
    This Chinese Strike table(中文笔画表) is downloaded by Internet.
  1. strokes.txt: https://github.com/helmz/Corpus/tree/master/zh_dict , https://www.cnblogs.com/Comero/p/8997585.html

  2. full_address1.csv: 不好意思我忘了具体链接了,你可以搜一下“中文地址库”
    这个链接可以当作参考: https://blog.csdn.net/qq_37352702/article/details/78933321

好的,非常感谢