Text Replacement not working as Expected
manovasanth1227 opened this issue · 6 comments
Describe the bug
Hi Team, I am looking to replace content or text that contains a specific pattern, such as {{sample}}. When I use the existing substitute method in the TextRun class, it appears to not be working.
Upon debugging the issue, I discovered that while reading the DOCX file, the nodes (specifically w:t elements) have been split into multiple tags.
For example, if the word is {{vorname}},
paragraph.each_text_run => returns array of text_run object
For Easy Explanation I will represent array of text_run object as like this: ['{{', 'v', 'orname', '}}'].
Consequently, when attempting to replace the text content using the substitute method, which checks the TextRun objects one by one, it misses the word that exists in the first place due to this splitting.
To Reproduce
Have content being edited multiple times in docx file, It will eventually break into multiple nodes as described above.
example
require 'docx'
path = "/Users/mb/Downloads/sample.docx"
doc = Docx::Document.new(path)
doc.paragraphs.each_with_index do |paragraph, index|
puts "paragraph: #{index} text: #{paragraph.text}"
paragraph.each_text_run do |text_run|
puts "text: #{text_run.text}"
text_run.substitute('{{vorname}}', 'Not Working Fine')
# it fails to replace the placeholder when only a portion of the placeholder text is present, rather than the complete word.
end
end
## output for the above code :
paragraph: 0 text: 24 Nov, 2023
text: 24
text: Nov,
text: 2023
paragraph: 1 text:
paragraph: 2 text: {{vorname}}
text: {{
text: v
text: orname
text: }}
paragraph: 3 text:
paragraph: 4 text: {{nachname}}
text: {{
text: n
text: achname
text: }}
paragraph: 5 text: {{Stellentitel}}
text: {{
text: Stellentitel
text: }}
paragraph: 6 text:
paragraph: 7 text: Subject: Test
text: Subject:
text: Test
paragraph: 8 text:
paragraph: 9 text: {{Name}}
text: {{
text: Name}}
paragraph: 10 text: {{asaso}}
text: {{
text: asas
text: o}}
paragraph: 11 text: {{Manovasanth}}
text: {{Manovasanth}}
Sample docx file
Expected behavior
Correctly replace the text placeholder text( {{vorname}}, {{Manovasanth}} ) with the given replacement text
Environment
- Ruby version: [e.g 2.6.9]
docx
gem version: [e.g 0.6.2]- OS: [e.g. MacOS 13.5.2]
I have tried using replacing the text in paragraph level also. It changes the style to the existing text.
So replacing the placeholder text in paragraph level is not a correct way.
doc.paragraphs.each do |paragraph|
paragraph.text = paragraph.text.gsub('{{vorname}}', 'Not Working Fine')
end
Hi @manovasanth1227 is there any update or workaround you found?
Hey @ArisNance.
I have just overridden initialize method in paragraph class.
Here, I replaced the corrupted text_run nodes with empty content and replace the correct placeholder text in any of the corrupted text_run nodes.
Please note that it will work based on the regex pattern which matches any text enclosed with two curly braces
starting - "{{"
ending - "}}"
module Docx
module Elements
module Containers
class Paragraph
PLACEHOLDER_REGEX = /\{\{(.*?)\}\}/
=begin
@param [w:body/w:p tag - Nokogiri Object] :node
@param [Hash] :document_properties
This method overrides the existing initialize in docx gem Paragraph class.
We have called the validate_placeholder_content method which is responsible for
correcting the corrupted text nodes in paragraphs.
=end
def initialize(node, document_properties = {})
@node = node
@properties_tag = 'pPr'
@document_properties = document_properties
@font_size = @document_properties[:font_size]
validate_placeholder_content
end
=begin
This method detect and replace the corrupted nodes if any exists.
=end
def validate_placeholder_content
placeholder_position_hash = detect_placeholder_positions
content_size = [0]
text_runs.each_with_index do |text_node, index|
content_size[index + 1] = text_node.text.length + (index.zero? ? 0 : content_size[index])
end
content_size.pop
placeholder_position_hash.each do |placeholder, placeholder_positions|
placeholder_positions.each do |p_start_index|
p_end_index = (p_start_index + placeholder.length - 1)
tn_start_index = content_size.index(content_size.select { |size| size <= p_start_index }.max)
tn_end_index = content_size.index(content_size.select { |size| size <= p_end_index }.max)
next if tn_start_index == tn_end_index
replace_incorrect_placeholder_content(placeholder, tn_start_index, tn_end_index, content_size[tn_start_index] - p_start_index, p_end_index - content_size[tn_end_index])
end
end
end
=begin
This method detect the placeholder's starting index and return the starting index in array.
Ex: Assumptions : text = 'This is Placeholder Text with {{Placeholder}} {{Text}} {{Placeholder}}'
It will detect the placeholder's starting index from the given text.
Here, starting index of '{{Placeholder}}' => [30, 55], '{{Text}}' => [46]
@return [Hash]
Ex: {'{{Placeholder}}' => [30, 55], '{{Text}}' => [46]}
=end
def detect_placeholder_positions
text.scan(PLACEHOLDER_REGEX).flatten.uniq.each_with_object({}) do |placeholder, placeholder_hash|
next if placeholder.include?('{') || placeholder.include?('}')
placeholder_text = "{{#{placeholder}}}"
current_index = text.index(placeholder_text)
arr_of_index = [current_index]
while !current_index.nil?
current_index = text.index(placeholder_text, current_index + 1)
arr_of_index << current_index unless current_index.nil?
end
placeholder_hash[placeholder_text] = arr_of_index
end
end
=begin
@param [String] :placeholder
@param [Integer] :start_index, end_index, p_start_index, p_end_index
This Method replaces below :
1. Corrupted text nodes content with empty string
2. Proper Placeholder content within the same text node
Ex: Assume we have a array of text nodes content as text_runs = ['This is ', 'Placeh', 'older Text', 'with ', '{{', 'Place', 'holder}}' , '{{Text}}', '{{Placeholder}}']
Here if you see, the '{{placeholder}}' is not available in the same text node. We need to merge the content of indexes - text_runs[5], text_runs[6], text_runs[7].
So We will replace the content as below:
1. text_runs[5] = '{{Placeholder}}'
2. text_runs[6] = ''
3. text_runs[7] = ''
=end
def replace_incorrect_placeholder_content(placeholder, start_index, end_index, p_start_index, p_end_index)
for index in (start_index)..(end_index)
if index == start_index
current_text = text_runs[index].text.to_s
current_text[p_start_index..-1] = placeholder
text_runs[index].text = current_text
elsif index == end_index
current_text = text_runs[index].text.to_s
current_text[0..p_end_index] = ''
text_runs[index].text = current_text
else
text_runs[index].text = ''
end
end
end
end
end
end
end
Not sure if this will work in all cases. If you have any other thoughts on this solution, please share.
@satoryu can you please help here ?
@manovasanth1227 thanks for the quick response and solution. it's a nice workaround and I can confirm it works for my need. I really appreciate it friend!
@satoryu any update on this ?
@manovasanth1227 think you can try this instead of monkey patching it:
doc.paragraphs.each do |p|
p.each_text_run do |tr|
tr.substitute(tr.text, tr.text.to_s.gsub('{{vorname}}', 'Working Fine')) if tr.text =~ /\{\{vorname\}\}/i
end
end