Provide the traditional Chinese medicine data used for machine learning
Currently the raw information comes from https://lib-nt2.hkbu.edu.hk/database/cmed/cmfid/index.asp
Embedding vectors(GloVe vectors) as the embeddings for each traditional Chinese medicine is build. To run the shell script tch.sh, please git clone the project GloVe(https://github.com/stanfordnlp/GloVe). The file "textTCM" which contains 182 Chinese medicine prescriptions with each one in a line is the input file. The files "vocabTCM.txt" and "vectorsTCM.txt" are generated, "vectorsTCM.txt" contains vector representations for TCM words. More info about GloVe please refer to https://nlp.stanford.edu/projects/glove/
The csv file IndexName.csv is generated by the the python file writeIndexName.py with the command "python3 writeIndexName.py".
The csv file IndexName.csv contains the information of the index for each traditional Chinese medicine, which could be used for the one hot encoding of the traditional Chinese medicine.
The csv file TCM.csv is generated by the python file writeDataSet.py with the command "python3 writeDataSet.py".
The information in the column "composition" and "label" in the csv file could be used to train a Text Categorizer Model, then the model could be used to predict the "label" of any "composition".
To get more idea how to do the training, here is an example https://www.kaggle.com/matleonard/text-classification
-
Add more Chinese medicine prescriptions in the file "textTCM".
-
Perform the supervised learning("label" predict based on "composition" as the first try) based on the vector representations of TCM words.
Enjoy! Please free to contact me by "zuguoxiang@foxmail.com" in case you are interested in this project, have any doubts, or wish to cooperate.