中文知识图谱实体链接服务。
为了方便按照需求改造jieba,所以项目中直接将jieba的源码作为了项目内的一个module。
对原始的jieba做了如下的修改:
- 在init文件中修改re_han_default,添加了空格、/、<、>。
- 自定义词典改成用tab作为分隔,以支持英文自定义词典。
- 修改了tokenize方法对于mode=search时的分词
- 修改了cut中cut_all的bug
之所以将whoosh的代码添加到当前项目,是为了解决这里越界的问题。
修改了whoosh\column.py中201行和202行的代码。
# 把原来的allow_longs=False改成了allow_longs=True
self._lengths = GrowableArray(allow_longs=True)
self._offsets = GrowableArray(allow_longs=True)
...
完成34690000行,耗时18036.621724200002
完成34700000行,耗时18043.1189176
完成34710000行,耗时18046.7838767
完成34720000行,耗时18053.1160476
完成34730000行,耗时18056.683092
完成34740000行,耗时18060.2204738
Traceback (most recent call last):
File "D:\anaconda\lib\site-packages\whoosh\util\numlists.py", line 57, in append
self.array.append(n)
OverflowError: Python int too large to convert to C unsigned long
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "index_create.py", line 10, in <module>
triple_index_create(data, index_dir)
File "E:\PythonCodes\EntityLinking\entitylinking\index_whoosh\triple_index_creator.py", line 31, in triple_index_create
_create_triple_index(storage, data)
File "E:\PythonCodes\EntityLinking\entitylinking\index_whoosh\triple_index_creator.py", line 44, in _create_triple_index
_write_document(writer, data)
File "E:\PythonCodes\EntityLinking\entitylinking\index_whoosh\triple_index_creator.py", line 82, in _write_document
object=fields[2].strip()
File "D:\anaconda\lib\site-packages\whoosh\writing.py", line 786, in add_document
perdocwriter.finish_doc()
File "D:\anaconda\lib\site-packages\whoosh\codec\whoosh3.py", line 250, in finish_doc
self.add_column_value("_stored", STORED_COLUMN, sf)
File "D:\anaconda\lib\site-packages\whoosh\codec\base.py", line 821, in add_column_value
self._get_column(fieldname).add(self._docnum, value)
File "D:\anaconda\lib\site-packages\whoosh\columns.py", line 1265, in add
self._child.add(docnum, v)
File "D:\anaconda\lib\site-packages\whoosh\columns.py", line 855, in add
VarBytesColumn.Writer.add(self, docnum, v)
File "D:\anaconda\lib\site-packages\whoosh\columns.py", line 276, in add
self._offsets.append(self._offset_base)
File "D:\anaconda\lib\site-packages\whoosh\util\numlists.py", line 59, in append
self._retype(n)
File "D:\anaconda\lib\site-packages\whoosh\util\numlists.py", line 48, in _retype
raise OverflowError("%r is too big to fit in an array" % maxnum)
OverflowError: 4294967357 is too big to fit in an array
import whoosh.index as index
import whoosh.qparser as qparser
from whoosh.qparser import QueryParser
import sys
sys.path.append('/Users/caoxiaojie/pythonCode/EntityLinking')
index_dir = '/Users/caoxiaojie/pythonCode/EntityLinking/data/index_test'
ix = index.open_dir(index_dir, indexname='triple')
s = ix.searcher()
qp = QueryParser('subject',schema=ix.schema,group=qparser.OrGroup)
qp2 = QueryParser('subject',schema=ix.schema)
q = qp.parse(u'subject:(李娜) AND object:(唱歌) OR subject(唱歌)')
results = s.search(q, limit=20)
for r in results:
print(r['subject'])
在百度大脑网站上,提供了一个Entity Linking任务的数据集,这个数据集包含了一个有39万实体,及三百多万的三元组的数据集,也提供了测试数据、训练数据。而且这些数据也很契合搜索场景中的实体链接任务,可以用它们进行模型的评估。