OfficeDev/office-js

Paragraph.text is returning wrong character when logging

Closed this issue · 7 comments

Provide required information needed to triage your issue

Your Environment

  • Platform [PC desktop, Mac, iOS, Office on the web]: Mac
  • Host [Excel, Word, PowerPoint, etc.]: Word
  • Office version number: 16.85
  • Operating System: 14.3
  • Browser (if using Office on the web): Chrome

Expected behavior/Current behavior

I have a word document (Test_doc.docx) where I have a paragraph:

image

I expect it to be printed out as:

(name, address), registered with [the commercial register of the Lower Court (Amtsgericht) of [*] under registration no. (the Purchaser)

But Instead I get

(name, address), registered with [the commercial register of the Lower Court (Amtsgericht) of [(] under registration no. (the Purchaser)

Steps to reproduce

  1. Open File
  2. Use this snippet in Scriptlab:
$("#run").on("click", () => tryCatch(run));

async function run() {
  await Word.run(async (context) => {
    const paragraphs = context.document.body.paragraphs;
    paragraphs.load("items");

    await context.sync();

    // Output the text of each paragraph using both the property and getText method
    for (const paragraph of paragraphs.items) {
      console.log("Paragraph text property:", paragraph.text);

      // Using getText method
      const textRange = paragraph.getText();

      await context.sync();

      console.log("Paragraph getText method:", textRange.value);
    }
  });
}

// Default helper for invoking an action and handling errors.
async function tryCatch(callback) {
  try {
    await callback();
  } catch (error) {
    // Note: In a production add-in, you'd want to notify the user through your add-in's UI.
    console.error(error);
  }
}
  1. Check output logs, you can see the differences

Context

Useful logs

  • Console errors
  • [ X] Screenshots
  • [ X] Test file (if only happens on a particular file)

Thank you for taking the time to report an issue. Our triage team will respond to you in less than 72 hours. Normally, response time is <10 hours Monday through Friday. We do not triage on weekends.

Test file:
Test_doc.docx

Thank you for letting us know about this issue. We will take a look shortly. Thanks.

hi @datitran thanks for reporting this case, in order for use to better understand the case, would you please show us how this character is inserted? thanks.

@jipyua actually I don't know how it was inserted. It's a file from a client. I believe they have some tools where they can insert some blobs. Weirdly I checked the OOXML representation and it gives me the right unicode back. This paragraph that I added I just copied and paste into a new word file. Somehow it inherits the "style".

@datitran thanks for providing the information, we have logged this issue and will do some investigate at our side. We will share information to you as long as we have some progress. thanks.

@jipyua hey any update on this one? I found another word document with this problem but it's a different symbol. You can find the word document here: https://nvca.org/recommends/nvca-ira-updated-april-2024/ (see page 42).

Screenshot 2024-06-03 at 16 15 17

So getText is correct but `text´ property is wrong.

I'm the creator of LegalLint (https://appsource.microsoft.com/de-de/product/office/wa200006962?tab=overview) and this will create false positive like this:

Screenshot 2024-06-03 at 16 15 05

What would you recommend? Switching to getText? getText is very slow as I need to do another context.sync. Is there a better way than this?

I actually found a solution to make getText fast, didn't know I could collect in a map before and then just do context.sync once. This works and it's super fast now nevertheless the bug is still important to fix for the .text

@datitran thank you and sorry for the late reply. Yes, as you mentioned, getText() can be used with other apis calls in the same batch and you should be able to get a better perf. For the .text API, we have noticed some special characters can't be fetched correctly, that's one of the reasons that a new getText() API is provide. We will be very careful to change the behavior of .text for backward compatibility reasons. And at this case, it's always suggested to use getText() API.