pytorch_transformer_translate

使用pytorch深度学习框架，基于Transformer的机器翻译，并且使用gradio框架写了一个页面用于输入句子进行翻译

This project's reference to Attention Is All You Need and Pytorch-based release Transoformer's machine translator.About explain the detail,please click here to view.

Project Environment

Device: Server
NVIDIA：GA102[GeForce RTX 3090]
Anaconda Environment:Pyhon 3.6.13  Pytorch 1.9.0+cuda1.1.1  Tokenizers 0.12.1  Transformers 4.18.0

1.Project structure


en_it
	configs(include some dataset's information,in fact we don't use it)
	dataset(include train's dataset)
		data-00000-of-00001.arrow(from HuggingFace download dataset)
		dataset_info.json（include the download dataset some information）
		state.json
		tokenizer_en.json
		tokenizer_it.json
		tokenizer_zh.json
	opus_books_weights(save the train's weights file)
	runs(save the train's log)
	config.py(train's configuration information)
	dataset.py(read the train and test dataset)
	model.py(Transformer's all structure)
	predict.py(translate the single sentence)
	train.py
en_zh
	dataset
		zh_en_dataset
			myProcess
				zh_en01（the first dataset, from English to Chinese）
					zh_en.json（from .txt file transform to JSON file save the dataset）
					zh_en.txt
					zh_en_process.py（process the source dataset）
				zh_en02（the second dataset, from English to Chinese）
					- with zh_en01 equal
			zh_en(in fact,we don't use it)
			tokenizer_en.json(generate the English tokenizer)
			tokenizer_zh.json(成generate the Chinese tokenizer)
	runs（save the train's log）
		en_zh01
		en_zh02
	weights（save the train's weights file）
		en_zh01_weights
		en_zh02_weights
	predict_en_zh.py
	zh_en_config.py
	zh_en_dataset.py
	zh_en_train.py
website
	flagged
	app.py(web）
train_wb.py(with train.py not difference)
translate.py(input a single sentence and translate it)

Notice: above give some files,when you train youselves dataset,please chanage the path

2.Train en_it

train process is not complicate，configuration information in config.py file，here give the train 720 epoch's weights file：

link：https://pan.baidu.com/s/1g5Y38okBPb4AnE7A2RFaww Extract code：1n54

3.Train en_zh

train process is not complicate，configuration information in zh en config.py's file，here give the 146 and 14 epoch's weights file：

train the zh_en01 dataset save weights file：

link：https://pan.baidu.com/s/1mj_qZ4xadH9T7WtJYQ_L-A

Extract code：wjgk

train the zh_en02 dataset save weights file：

link：https://pan.baidu.com/s/1FduLVLHnnkf2vXMf39lDgQ

Extract code：evhe

4.Inference

train app.py(please change youselves file path)

run and then output a url： http://127.0.0.1:7860

（1）from English to Italian's translate：

（2）from English to Chinese's translate：

中文翻译(和英文说明不一定一一对应)

该项目是参考了Attention Is All You Need 和采用Pytorch深度学习框架实现Transformer的机器翻译..

1.项目结构如下


en_it
	configs(主要包含了一些数据集的信息，实际上并没有使用到)
	dataset(保存数据集的文件)
		data-00000-of-00001.arrow(从HuggingFace下载的数据集)
		dataset_info.json（数据集的一些信息）
		state.json
		tokenizer_en.json
		tokenizer_it.json
		tokenizer_zh.json
	opus_books_weights(保存训练过程的权重文件)
	runs(保存训练的日志记录)
	config.py(训练的一些配置信息)
	dataset.py(读取数据集的文件)
	model.py(Transformer整体结构)
	predict.py(用于单个语句的翻译预测)
	train.py(英文到意大利语的训练)
en_zh
	dataset
		zh_en_dataset
			myProcess
				zh_en01（第一个英文到中文翻译的数据）
					zh_en.json（从.txt数据转换到使用JSON文件保存）
					zh_en.txt
					zh_en_process.py（处理数据集文件）
				zh_en02（第二个英文到中文翻译的数据）
					- 和zh_en01的结构一样
			zh_en(可以不用看，只是当时使用到了Bert中的预训练模型对句子划分tokens，但是实际上没有用到)
			tokenizer_en.json(生成得到的英语分词器)
			tokenizer_zh.json(成得到的中文分词器)
	runs（保存两个数据集训练日志）
		en_zh01
		en_zh02
	weights（保存两个数据集训练的权重文件）
		en_zh01_weights
		en_zh02_weights
	predict_en_zh.py
	zh_en_config.py
	zh_en_dataset.py
	zh_en_train.py
website
	flagged
	app.py（网页的界面文件）
train_wb.py(和上面给出的train.py的内容差不多，区别在于train_wb.py采用了wandb记录训练过程)
translate.py(将训练好的模型用于用户输入句子的翻译)

注意：以上的给出的文件目录，在自己训练数据集的过程中，有些路径自己修改一下即可训练自己的模型

2.训练en_it目录下英文到意大利语言的翻译

训练并不复杂，参数也不多，都在**config.py**文件中，因此这里给出训练了**720个epoch**的权重文件：

链接：https://pan.baidu.com/s/1g5Y38okBPb4AnE7A2RFaww 提取码：1n54

3.训练en_zh目录下英文到意大利语言的翻译

训练并不复杂，参数也不多，都在zh en config.py文件中，因此这里给出训练了720个epoch的权重文件：

训练zh_en01数据集下得到的权重文件：

链接：https://pan.baidu.com/s/1mj_qZ4xadH9T7WtJYQ_L-A

提取码：wjgk

训练zh_en02数据集下得到的权重文件：

链接：https://pan.baidu.com/s/1FduLVLHnnkf2vXMf39lDgQ

提取码：evhe