The python script runs 2 versions of cleaning and returns a file with 4 additional columns:
- Regex matching with "<>" , "&;"(with 4 or 5 characters in between) anything in between will be removed and "\*" will be replaced with a white space character. Note: the special characters will simply be removed. eg: &rpos; etc.
- BeautifulSoup HTML to text conversion. This will remove HTML tags and convert special characters into their respective ASCII characters
- 2 parity columns which will return the difference in the number of charcters between the newly generated columns and the original columns. (This is basically a flag that you can check if there has been too many characters replaced)
You need to install these modules:
- pandas
- bs4
- lxml
example:
python -m pip install bs4 lxml pandas
- Place the file in the same directory as the csv file
- open terminal at the file location windows :
ctrl
+r
thencmd
thencd <path to file>
- Type:
python remove_html.py
and hit enter - Follow the instructions
- You are done.
- Auto detect filetype
- multicolumn support