ljsabc/Fujisaki

feature need: wechat or qq support need

Opened this issue · 5 comments

Thank you! I really enjoy it and can`t hesitate to train a version of myself!
could you add a feather to use the message log from WeChat or qq?
This is because most of us in mainland use these software more frequently, and these feature will make more people available

To be frank it's initially a PoC project to demonstrate the ability to extrapolate (or interpolate) from general LLM knowledge to enjoy "personalities".

From my point of view, I think it's a good idea to parse any sort of text (or conversations), but I really do not have too much time in building up additional parsers.
Would you mind (or someone please) showing a proof-of-concept, or a repo that can parse wechat/qq dialogues, such that we can consider integrating into this project?

I will leave this issue opened, and suppose there's any chance, I will also do some investigation myself.

QQ的聊天记录可以用QQ自带导出,支持好几种方式
问题是需要转换成什么样的格式?
还有群聊中的上下文如何处理

其实本质上只需要做两件事情:

  • 确定一串聊天属于同一个内容,可以根据时间分割,也可以只考虑前N条信息,即便有假阳性问题也不大
  • 知道哪句话是自己发的哪句话是别人发的

在确定了这件事情之后,那问题就方便许多了,只需要在每一组属于同一个内容的对话里:

  1. 随机选择一条你自己的发言
  2. 在这条发言之前随机选择N(N不宜过大,建议是一个泊松分布,太大了网络也不好训练)条聊天记录,用"\n"连接起来作为instruction。如果你有更好的生成instruction,或者拼接字符串的建议,那自然更棒了。
  3. 之前的N条聊天记录可以包含自己的发言
  4. 将随机选择的你自己的发言作为response
  5. 根据#1 的要求准备数据集

注意,如果你的instruction的生成方式不是"\n"拼接,那么测试(inference)的时候也要做相同的instruction。
大概就是这样的思路。至于导出什么格式,取决于哪种格式最好parse,并且能实现最开始说的那两点要求。

想问一下如果提供qq群记录的sql文件会不会处理起来方便许多