anaylyse_file_structure

This tool helps to convert some file with specific format to an structured csv document.

说明

此项目最初始的目的为将带有结构标记的文档的内容进行转化，生成树形结构。可以方便导出为csv,或者提取其中的关键内容。

主要模块：

node.py

提供对文档结构的抽象，每个node 对象有next 和child 结点，只有顶层节点有next 节点，文档父子结构通过child 节点定义，同一层级的child 结点先后顺序即为文档中先后顺序.

示例:

from node import Node

node1 = Node("Top Node 1")
node2 = Node("Top Node 2")
node2.val = "Change its value"
node1.next = node2
node1.addChild(Node("Child Node))

convert.py

提供对一段文档处理的方法，可以将结构化文档转化为一个node 对象: 暴露出convertStringListToNode 方法, 接收参数分别为: str_list: 处理的文档所有内容； rex_list: 文档的层级关系正则表达式 current_level: 文档起始层级，默认为0 node: 传入的node 对象

这里确实还可以优化，暴露出一个更易用的方法。不过不太会写python

用法:

示例1：

import re
import convert
from node import Node

def convertToString():
    str = """1. 为什么要用React?

    这个问题直接回答不好回答，我们不妨换个问题，用React 有什么优势？—— 

    快，好。首先Html 语言天生就没有复用的能力，它只是标记网页上的元素，所以如果有相同的元素，也只能copy, paste 地复用，这样即无效率，也不优雅。

    于是用起了js, js 倒是能通过函数实现复用，但是其操作dom 元素需要定位，然后对找到的元素进行操作。原生dom 其实写起来很麻烦，把这一部分js 进行封装，就是当时的jquery。而jquery 有什么问题呢？jquery 自身没有问题，只是没有jsx 好用。js 直接操作dom 对象其实是最快的，但是通过函数的设置，总归不是html 本身的结构，用起来不直观，也不方便。React 的出现，使得我们能在js 代码中直接定义树形结构，将jquery 中对element 的操作再往上封装了一层，从React 开始，开发者只需要关注页面应该是怎么样子，而不是怎么使页面变成这个样子，简化了开发的工作——至于是否变得更简单了，我认为其实没有，关键在于你需要理解的东西变更多了。

    所以React 的优势，在于简化了开发，加快了开发流程，也使得开发出来的代码更好看，好维护了。

    
2. 为什么要使用jsx？

    先说什么是jsx，jsx 是一种可以在js 代码中写类似xml 标记语言的语言，其原理是将jsx 代码通过某种工具转换成js 代码。

    通过jsx 直接定义结构使得前端页面的结构可以直观地展示和复用。
    """

    rex_list_str = ["^\d{0,3}\..*"]

    list = str.splitlines()

    rex_list = []
    for x in rex_list_str:
        rex_list.append(re.compile(x))
    node = Node()
    convert.convertStringListToNode(list, rex_list,0,node)
    node.toDataFrams('nodejs+react.csv')

示例2:

from node import Node
import re
import convert 
from node import Node
file = open("./TransferPDFToCSV/2022 AWS SAP-C01(494题中文).txt", "r+", encoding="utf-16")

contents = []

rex = ["^\u95EE\u9898",".*\u7B54\u6848.*"];
count = 0;
for line in file:
    if count < 100000:
        contents.append(line)
        count = count + 1

rex_list = []
for x in rex:
        rex_list.append(re.compile(x))

node = Node()
firstNode = node;

currentQuestion = "";
currentAnswer = "";
appendQuestion = False;
appendAnswer = False

for str in contents:
    if(rex_list[0].match(str)):
        if(currentQuestion != ""):
            node.val = currentQuestion;
            child = Node(currentAnswer)
            node.addChild(child)
            next = Node();
            node.next = next;
            node = next;

        appendAnswer=False
        appendQuestion=True
        currentQuestion=str
    elif (rex_list[1].match(str)):
        str = str.strip()
        if not str.startswith("答案") :
            list = str.split("答案")
            first = list[0]
            currentQuestion = currentQuestion + "\n" + str
            str = "答案" + list[1]

        appendAnswer=True
        appendQuestion=False
        currentAnswer = str
    elif (appendQuestion):
        currentQuestion = currentQuestion + "\n" + str
    else: 
        currentAnswer = currentAnswer + "\n" + str

firstNode.toDataFrams()

CTCSU/anaylyse_file_structure

anaylyse_file_structure

说明

主要模块：

node.py

convert.py

用法: