TCMDataForML

Basic

Provide the traditional Chinese medicine data used for machine learning

Currently the raw information comes from https://lib-nt2.hkbu.edu.hk/database/cmed/cmfid/index.asp

Vector representations for TCM words(unsupervised learning)

Embedding vectors(GloVe vectors) as the embeddings for each traditional Chinese medicine is build. To run the shell script tch.sh, please git clone the project GloVe(https://github.com/stanfordnlp/GloVe). The file "textTCM" which contains 182 Chinese medicine prescriptions with each one in a line is the input file. The files "vocabTCM.txt" and "vectorsTCM.txt" are generated, "vectorsTCM.txt" contains vector representations for TCM words. More info about GloVe please refer to https://nlp.stanford.edu/projects/glove/

Index for TCM words

The csv file IndexName.csv is generated by the the python file writeIndexName.py with the command "python3 writeIndexName.py".

The csv file IndexName.csv contains the information of the index for each traditional Chinese medicine, which could be used for the one hot encoding of the traditional Chinese medicine.

"label" predict based on "composition"(supervised learning), in progress

The csv file TCM.csv is generated by the python file writeDataSet.py with the command "python3 writeDataSet.py".

The information in the column "composition" and "label" in the csv file could be used to train a Text Categorizer Model, then the model could be used to predict the "label" of any "composition".

To get more idea how to do the training, here is an example https://www.kaggle.com/matleonard/text-classification

TODO

Add more Chinese medicine prescriptions in the file "textTCM".
Perform the supervised learning("label" predict based on "composition" as the first try) based on the vector representations of TCM words.

Last

Enjoy! Please free to contact me by "zuguoxiang@foxmail.com" in case you are interested in this project, have any doubts, or wish to cooperate.