bellamkondaprakash/Classification_Text_Cancer_Data

Any NLP would require a basic proprocessing to be carried out, inorder to clean the and the Label the dataset on which NLP PreProcessing can be done.

Jupyter NotebookMIT

Text Processing and Labelling the Dataset

To Label the semi-structured data(XML Files), it should be preprocess to be carried out, inorder to have a clean data & Label the data ,on which NLP Precesing can be done.

Load and Remove the XML, HTML tags & Alphanumeric characters

Load the whole xml dataset using tqdm library and clean the all the files using BeautifulSoup and regex Libraries from the xml documents.
Replace the numberic with space.

Labelling the Dataset

Using the Pandas append the summarizing documents with their labels
Load dataframe with labeling and documents.