How to Handle / Remove HTML Tags

Question

How to Handle / Remove HTML Tags

Closed this issue 6 years ago · 1 comments

I'm working on a project using Ruby On Rails. and im using this gem 'caracal-rails'..

In my project, there is this table called Service Standards. And in that table there is a column called description which is in MySQL and the type of that column is TEXT.. So when it is rendered in a browser, the text description looks fine in html as it should be. The text saved in MySQL is like below:-

"PREAMBLE\r\n<br>Each Facility shall have a body ultimately responsible for all aspects of the Facility’s operations."

and when it is rendered in the browser it looks like in the example below:

"Preamble
Each Facility shall have a body ultimately responsible for all aspects of the Facility’s operations."

but when it is generated in docx caracal, the text description shows the
tag as well in the given example below:

"Preamble <br>Each Facility shall have a body ultimately responsible for all aspects of the Facility’s operations."

So how do i "tune" the docx to make it as the same in the browser (second example)?

I need some guidance on this. I'm totally out of idea on how to make it work in the generated docx.

Answer 1 · 2018-08-13T12:45:05.000Z

Hi, there.

This is ultimately more of a Ruby question than a Caracal question per se, but since I'm a capital fellow, I'll answer anyway.

The reason this looks okay in the browser is because it treats the carriage return \r and line feed \n as white space, which, like all white space, it is condenses into a single blank space. The browser understands what the <br> command means, so it does what your expect.

The Word document also treats the carriage return and line feed as white space, but it has no idea what the <br> command means so it renders it as text. So, you'll need to use Ruby to convert the commands in the text string Word cannot understand into those that it can.

So instead of something like this:

docx.p your_text_string

You'd want to do something more like this:

str = your_text_string.gsub(/\r/, ' ')    # replace carriage returns
str = str.gsub(/\n/, ' ')                 # replace line feeds
str = str.gsub(/\s+/, ' ')                # squash white spaces
arr = str.split('<br>')                   # split text by break tags

docx.p do
  arr.each_with_index do |s, index|
    br  unless index == 0
    text s
  end
end

You can read more about more complex paragraph formatting in the README.