/Classification_Text_Cancer_Data

Any NLP would require a basic proprocessing to be carried out, inorder to clean the and the Label the dataset on which NLP PreProcessing can be done.

Primary LanguageJupyter NotebookMIT LicenseMIT

Text Processing and Labelling the Dataset

To Label the semi-structured data(XML Files), it should be preprocess to be carried out, inorder to have a clean data & Label the data ,on which NLP Precesing can be done.

Load and Remove the XML, HTML tags & Alphanumeric characters

  1. Load the whole xml dataset using tqdm library and clean the all the files using BeautifulSoup and regex Libraries from the xml documents.
  2. Replace the numberic with space.

Labelling the Dataset

  1. Using the Pandas append the summarizing documents with their labels
  2. Load dataframe with labeling and documents.