page.GetContentAsync throwing Cannot read incomplete UTF-16 JSON text as string with missing low surrogate
Tiggerito opened this issue · 4 comments
Description
Navigating to some pages causes the GetContentAsync method to throw an exception.
var options = new LaunchOptions { /* */ };
var chromiumRevision = BrowserFetcher.DefaultRevision;
var browser = await Puppeteer.LaunchAsync(options, chromiumRevision);
var page = browser.NewPageAsync();
await page.GoToAsync('https://domain.com/');
var content = await page.GetContentAsync(); // exception
Replace domain with getglowingnowskincare;
Expected behavior:
The content is returned.
Actual behavior:
The following exception is thrown:
The JSON value could not be converted to System.String. Path: $ | LineNumber: 0 | BytePositionInLine: 751401. | Cannot read incomplete UTF-16 JSON text as string with missing low surrogate.
at System.Text.Json.ThrowHelper.ReThrowWithPath(ReadStack& state, Utf8JsonReader& reader, Exception ex)
at System.Text.Json.Serialization.JsonConverter`1.ReadCore(Utf8JsonReader& reader, JsonSerializerOptions options, ReadStack& state)
at System.Text.Json.JsonSerializer.ReadFromSpan[TValue](ReadOnlySpan`1 utf8Json, JsonTypeInfo`1 jsonTypeInfo, Nullable`1 actualByteCount)
at System.Text.Json.JsonSerializer.Deserialize[TValue](JsonElement element, JsonSerializerOptions options) at PuppeteerSharp.Helpers.Json.JsonHelper.ToObject[T](JsonElement element, JsonSerializerOptions options) in /home/runner/work/puppeteer-sharp/puppeteer-sharp/lib/PuppeteerSharp/Helpers/Json/JsonHelper.cs:line 53
at PuppeteerSharp.Helpers.RemoteObjectHelper.ValueFromType[T](JsonElement value, RemoteObjectType objectType, Boolean stringify) in /home/runner/work/puppeteer-sharp/puppeteer-sharp/lib/PuppeteerSharp/Helpers/RemoteObjectHelper.cs:line 74
at PuppeteerSharp.Helpers.RemoteObjectHelper.ValueFromRemoteObject[T](RemoteObject remoteObject, Boolean stringify) in /home/runner/work/puppeteer-sharp/puppeteer-sharp/lib/PuppeteerSharp/Helpers/RemoteObjectHelper.cs:line 15
at PuppeteerSharp.ExecutionContext.RemoteObjectTaskToObject[T](Task`1 remote)
at PuppeteerSharp.IsolatedWorld.EvaluateFunctionAsync[T](String script, Object[] args)
Versions
19.0.2
net8.0
Solution
I believe there was a recent change in which JSON parser is used, which may have introduced this issue.
The exception relates to poorly formed characters on the page.
This can be fixed by converting the returned string with its toWellFormed() function.
I created my version of GetContentAsync with the following line changed, and the content was successfully returned:
content += document.documentElement.outerHTML.toWellFormed();
Do you have some HTML we can use as an example for a test?
getglowingnowskincare(dot)com is an example.
I tried finding a way to make the JSON serializer more forgiving, but I have not found a solid solution yet.
I like the idea of making it an option. That way, people can test for the issue.