kg build with unstructured data, an attempt!
wey-gu opened this issue · 5 comments
Input Data: paul_graham_essay.txt
Schema:
(RDF Style, will try PropGraph Style later)
CREATE TAG `entity` `name` string;
CREATE EDGE `relationship` `name` string;
Log: all.log
Model: Azure OpenAI: 3.5-turbo 2023-07-01-preview
It's smooth! Thanks! Something we may consider change:
...
the knowledge graph has following schema and node name must be a real :
...
NodeType "entity" ("name":string )
EdgeType "relationship" ("name":string )
...
{userPrompt}
...
- feat: consider putting the tailing NGQL DML at the end of all.log in a separate file to be downloadable?
- feat: support parsing edge prop
It seems in such rdf-style schema(as follows), with only one type of edge, and put edge type as edge.name, our current implementation is not guiding LLM to generate edge props. I guess we firstly address the typical Property Graph style modeling?
"""graph-schema
NodeType "entity" ("name":string )
EdgeType "relationship" ("name":string ) #<---------- name
"""
{userPrompt}
Return the results directly, without explain and comment. The results should be in the following JSON format:
{
"nodes":[{ "name":string,"type":string,"props":object }],
"edges":[{ "src":string,"dst":string,"edgeType":string,"props":object }] #<----------- there is no name here.
}
Thus, the extracted edges are w/o edge type(name prop):
INSERT EDGE `relationship` () VALUES "software as a service"->"Aspra":();
INSERT EDGE `relationship` () VALUES "still life"->"visual cues":();
INSERT EDGE `relationship` () VALUES "color changes"->"visual cues":();
INSERT EDGE `relationship` () VALUES "software"->"documents":();
feat: support parsing edge prop
{ "nodes":[{ "name":string,"type":string,"props":object }], "edges":[{ "src":string,"dst":string,"edgeType":string,"props":object }] #<----------- there is no name here. }
this prompt is just a example output for ensure the result format. the graph schema already inject to the prompt like NodeType "entity" ("name":string )
and some space may have many edge types & edge props so that to lead a very lang context.
in my case the edge prop is generated successfuly
{
"nodes": [
{ "name": "我", "type": "人物", "props": {"角色": "测试工程师"} },
{ "name": "Explorer项目", "type": "物品", "props": {"版本": ["v3.6.0", "v3.7.0"], "工作内容": ["测试工作", "提交issue", "设计并实施测试用例", "API自动化测试", "制定测试计划", "编写测试报告"] } },
{ "name": "Analytics项目", "type": "物品", "props": {"版本": ["v3.6.0"], "工作内容": ["迭代测试", "功能测试", "性能测试", "提交issue", "编写测试用例和测试报告"] } },
{ "name": "银行项目", "type": "物品", "props": {"工作内容": ["部署内部图计算测试集群", "数据开发", "构造风控业务场景", "编写并执行测试用例", "跟踪issue"] } },
{ "name": "confluence", "type": "物品", "props": {"用途": "记录测试用例"} },
{ "name": "cloud代码库", "type": "物品", "props": {"用途": "存储API自动化测试代码"} }
],
"edges": [
{ "src": "我", "dst": "Explorer项目", "edgeType": "关系", "props": {"关系类型": "负责测试"} },
{ "src": "我", "dst": "Analytics项目", "edgeType": "关系", "props": {"关系类型": "负责测试"} },
{ "src": "我", "dst": "银行项目", "edgeType": "关系", "props": {"关系类型": "负责支持与测试"} },
{ "src": "我", "dst": "confluence", "edgeType": "关系", "props": {"关系类型": "在其中记录测试用例"} },
{ "src": "我", "dst": "cloud代码库", "edgeType": "关系", "props": {"关系类型": "在其中提交API自动化测试代码"} }
]
}
this prompt is just a example output for ensure the result format. the graph schema already inject to the prompt like NodeType "entity" ("name":string )
I see, we could tune the prompt to better address this, in my case(where I think the schema generate no obvious confusion) it failed in most cases:
> match ()-[e]->() RETURN e
+------------------------------------------------------------------------------------------------------+
| e |
+------------------------------------------------------------------------------------------------------+
| [:relationship "teacher"->"grade" @0 {name: __NULL__}] |
| [:relationship "essay question"->"Cezanne" @0 {name: __NULL__}] |
| [:relationship "roommate"->"Robert" @0 {name: __NULL__}] |
| [:relationship "Florence"->"Duomo" @0 {name: __NULL__}] |
| [:relationship "Florence"->"Orsanmichele" @0 {name: __NULL__}] |
| [:relationship "Florence"->"Pitti" @0 {name: __NULL__}] |
| [:relationship "Florence"->"Via Ricasoli" @0 {name: __NULL__}] |
| [:relationship "Florence"->"budget" @0 {name: __NULL__}] |
| [:relationship "bedroom"->"night" @0 {name: __NULL__}] |
| [:relationship "Lisp"->"AI" @0 {name: __NULL__}] |
| [:relationship "Lisp"->"C++" @0 {name: __NULL__}] |
| [:relationship "Lisp"->"John McCarthy" @0 {name: __NULL__}] |
| [:relationship "Lisp"->"Lisp hacker" @0 {name: __NULL__}] |
| [:relationship "Lisp"->"McCarthy" @0 {name: __NULL__}] |
| [:relationship "Lisp"->"On Lisp" @0 {name: __NULL__}] |
| [:relationship "Lisp"->"Platonic form of Lisp" @0 {name: __NULL__}] |
| [:relationship "Lisp"->"Turing machine" @0 {name: __NULL__}] |
| [:relationship "Lisp"->"book" @0 {name: __NULL__}] |
| [:relationship "Lisp"->"expression" @0 {name: __NULL__}] |
| [:relationship "Lisp"->"shot" @0 {name: __NULL__}] |
| [:relationship "comment"->"angry people" @0 {name: __NULL__}] |
| [:relationship "RISD"->"HTML" @0 {name: __NULL__}] |
| [:relationship "RISD"->"Providence" @0 {name: __NULL__}] |
| [:relationship "RISD"->"art" @0 {name: __NULL__}] |
| [:relationship "RISD"->"job" @0 {name: __NULL__}] |
| [:relationship "RISD"->"money" @0 {name: __NULL__}] |
| [:relationship "experts"->"talks" @0 {name: __NULL__}] |
| [:relationship "run"->"air conditioners" @0 {name: __NULL__}] |
| [:relationship "Arc"->"interpreter" @0 {name: __NULL__}] |
| [:relationship "brain"->"feature" @0 {name: __NULL__}] |
I will look into the prompt and tune it to improve this in a PR then.
I will look into the prompt and tune it to improve this in a PR then.
ok, maybe some space is not work well with this prompt