A simple Python text reader/cleaner to handle HTML tagging in text files parsed from the web
14 May 2016
A simple text reader/cleaner to handle HTML tagging in text files parsed from the web. Please use responsibly and ethically, on files/content where you hold the necessary copyright permissions!
The program opens and reads a user-specified text file (or uses hard-coded default filename on 'enter'). Searching line by line based on html tagging, it appends only the relevant html-tagged lines to a new text string ('taggedtext'), then cleans those same lines (strips out the html following tagging) and appends the cleaned text to a new text string ('cleanedtext'), with additional labelling where relevant. Tags are handled as follows:
- <h1> tags: cleaned and saved, labelled as 'Title';
- <h2> tags: cleaned and saved, labelled as 'Header';
- <h3> to <h6> tags: cleaned and saved, labelled as 'Sub-header';
- <em> tags (italics): cleaned and saved, labelled as 'Para-header' (Paragraph header);
- <p> tags: indicate text paragraphs, cleaned and saved only (no additional labels added);
- all other tags: ignored.
The program has been hard-coded to deal with some additional commonly-identified exceptions in the cleaning process (with user input required to check and confirm action required: add to text string or ignore).
Two files are saved (user-entry required for filenames):
- One for the edited text with html tagging still retained;
- The other for the final edited, cleaned and labelled text.
Some problems to be ironed out:
- This program has problems with ASCII codes. The current work-around is to go in and perform a second, manual clean-up on the final output text file. This is too much manual intervention - need to find a way to fix this in-program.
Further improvements to be made:
- Use the 'Title' label (lowercase and stripped of whitespace) as a default output filename (with user override option);
- Add option to identify keywords, calculate their counts and append to the output file if required (see tag_eng & blog_tagger);
- Add option for user to identify key phrases (of 2, 3 or 4 words' length), calculate their counts and append to the output file if required (program still to be built);
- Handle other html tags as required (e.g. html tags inside cleaned/detagged text lines, such as italics, strong, etc.).
Future developments planned:
- Build program into a relevance engine by using all identified keywords/key phrases. Those keywords/phrases identified as coming from lines with header tags ( <h1>, <h2>, etc.) are to be weighted most highly, followed by words/phrases appearing most often in the paragraph text;
- Delve further into the web data headers and meta data to extract further useful info if required (+ save to output file)
- Search for all links on the page/in the text (in the body text and/or in the headers/footers) adn save to a directory (and/or append to the output file)
- Provide user with an option to follow any or the URL links identified (both internally to the domain host, and externally).