v0.2
Python 2.7
- 从研报和年报获取正文
- 利用TextRank算法从研报中提取带有权重的关键词
- 用研报关键词提取年报关键句子并标权重
-
文件夹导入 -
从结果里去除特殊标点 -
训练集文件夹和年报文件夹内无文件时报错 -
如果训练集变动,对关键词词典进行更新 - 按行业分类将分析报告和年报分类
- 从结果中排除冗余文字
- 改进LTTextBox的比较条件
- 创建冗余项排除列表
- 找到合适的统计学习模型来检测冗余项
- 从正文提取含有关键数据的句子
- 排除页眉页脚
- 提取图片
- 修改为面向对象性编程
- 训练自定义jieba分词词典
- 为自定义jieba词典的新单词标注词性
- 从正文提取关键词和对应数据并导入数据库
- 使用 Tesseract 或者 ABBYY FineReader Engine进行OCR
-
从年报里提取关键词- TextRank权重
-
词频
-
优先提取“董事会报告”或“经营情况讨论分析”
v0.2
Python 2.7
- Extract context from analytic reports and annual reports.
- Use TextRank to extract keywords with weight from analytic reports.
- Use keywords from analytic reports to extract context in annual reports.
-
Processing in directory. -
Clean some special Chinese marks from result. -
Raise ERROR if there is nothing in training and report directory. -
Update keywords dictionary if training set is changed. - Classify analytic reports and annual reports by sectors.
- Exclude redundant text from result.
- Improve comparison conditions in LTTextBox (determine_obj_text()).
- Create exclusion list for redundant keywords.
- Find and use ML model to detect redundant.
- Extract sentences those contain important data from context.
- Page header and page footer excludes.
- Image Extraction.
- OOP.
- Train a custom jieba dictionary for text segmentation.
- Add POS to new words in the custom jieba dictionary.
- Extract keywords and data from context and import to database.
- OCR by using Tesseract or ABBYY FineReader Engine.
-
Extract keywords from annaul report- TextRank weight
-
Word count
-
Extract 'Directors' Report' or 'Business Conditions Analysis' first.