To Label the semi-structured data(XML Files), it should be preprocess to be carried out, inorder to have a clean data & Label the data ,on which NLP Precesing can be done.
- Load the whole xml dataset using tqdm library and clean the all the files using BeautifulSoup and regex Libraries from the xml documents.
- Replace the numberic with space.
- Using the Pandas append the summarizing documents with their labels
- Load dataframe with labeling and documents.