/doc-merger

通过 python 脚本将两个相对不完整的文档合并为一个完整的文档 / merge two relatively incomplete documents into one complete document via python script

Primary LanguagePythonMIT LicenseMIT

doc-merger

使用 doc-merger 可以对两个文档中的内容进行比较和分析,然后在文档一的基础上,将文档二中的数据覆盖到文档一中对应的部分,输出合并结果并筛选出只存在于文档二中的数据。

功能演示

假设您有两个文本文件,doc1.txtdoc2.txt,这两个文件中都包含一些电视剧集的信息,但是信息都不完整,您希望使用 doc2.txt 补充和覆盖 doc1.txt 中的部分内容,并提取出只存在于 doc2.txt 中的数据。

doc1.txt 的内容如下:

1;2023-01-01;45;The first episode
2;2023-01-08;45;The second episode
3;2023-01-15;45;The third episode

doc2.txt 的内容如下:

2023-01-01;45;The first episode;This is the first episode of the show.
2023-01-08;45;The second episode;This is the second episode of the show.
2023-01-22;45;The special episode;This is the special episode of the show.

运行 doc-merger.py 脚本后,将生成一个名为 result.txt 的新文件,其中包含合并后的数据:

1;2023-01-01;45;The first episode;This is the first episode of the show.
2;2023-01-08;45;The second episode;This is the second episode of the show.
3;2023-01-15;45;The third episode;

此外,脚本还会生成一个名为 doc2_only.txt 的文件,其中包含只存在于 doc2.txt 中的数据:

2023-01-22;45;The special episode;This is the special episode of the show.

脚本还会打印一些统计信息:

Merged: 2
Doc1 only: 1
Doc2 only: 1

运行条件

  • 请确保您的系统上安装了 Python 3.0 或更高版本。

注意事项

  • 请确保文本文件中的数据格式符合脚本中定义的格式。例如,doc1.txt 中的每一行应该包含四个由分号分隔的字段,分别表示剧集编号、日期、时长和标题;doc2.txt 中的每一行应该包含四个由分号分隔的字段,分别表示日期、时长、标题和描述。

使用方法

  1. 将仓库克隆或下载到计算机上的一个目录中。
  2. 修改 start.command (Mac)start.bat (Win) 中的路径,以指向您存放 doc-merger.py 脚本的目录。
  3. 将要处理的文本分别保存为 doc1.txtdoc2.txt 文件,并放在与脚本相同的目录中。
  4. 双击运行 start.commandstart.bat 脚本以执行 doc-merger.py 脚本。
  5. 结果将写入到同一目录下名为 result.txtdoc2_only.txt 的文件中。

line-indexer

With doc-merger, you can compare and analyze the contents of two documents, then overlay the data from document two onto the corresponding part of document one based on document one, output the merged results and filter out data that only exists in document two.

Demo

Suppose you have two text files, doc1.txt and doc2.txt, both of which contain some information about TV episodes, but the information is incomplete. You want to use doc2.txt to supplement and overwrite some of the content in doc1.txt, and extract data that only exists in doc2.txt.

The content of doc1.txt is as follows:

1;2023-01-01;45;The first episode
2;2023-01-08;45;The second episode
3;2023-01-15;45;The third episode

The content of doc2.txt is as follows:

2023-01-01;45;The first episode;This is the first episode of the show.
2023-01-08;45;The second episode;This is the second episode of the show.
2023-01-22;45;The special episode;This is the special episode of the show.

After running the doc-merger.py script, a new file named result.txt will be generated, which contains the merged data:

1;2023-01-01;45;The first episode;This is the first episode of the show.
2;2023-01-08;45;The second episode;This is the second episode of the show.
3;2023-01-15;45;The third episode;

In addition, the script will also generate a file named doc2_only.txt, which contains data that only exists in doc2.txt:

2023-01-22;45;The special episode;This is the special episode of the show.

The script will also print some statistical information:

Merged: 2
Doc1 only: 1
Doc2 only: 1

Requirements

  • Make sure you have Python 3.0 or higher installed on your system.

Notes

  • Make sure that the data format in the text files conforms to the format defined in the script. For example, each line in doc1.txt should contain four fields separated by semicolons, representing the episode number, date, duration, and title; each line in doc2.txt should contain four fields separated by semicolons, representing the date, duration, title, and description.

Usage

  1. Clone or download the repository to a directory on your computer.
  2. Modify the path in start.command (Mac) or start.bat (Win) to point to the directory where you store the doc-merger.py script.
  3. Save the text to be processed as doc1.txt and doc2.txt files respectively and place them in the same directory as the script.
  4. Double-click start.command or start.bat to execute the doc-merger.py script.
  5. The result will be written to files named result.txt and doc2_only.txt in the same directory.