metaeducation/ren-c

Internal Error: invalid UTF-8 byte sequence found during decoding - on ü

Opened this issue · 8 comments

>> b: read http://www.google.com                                          
== #{
3C21646F63747970652068746D6C3E3C68746D6C206974656D73636F70653D22
22206974656D747970653D22687474703A2F2F736368656D612E6F72672F5765
...

>> x: copy/part at b 7912 5 
== #{476CFC636B}

>> to text! copy/part x 1   
== "G"

>> to text! copy/part x 2 
== "Gl"

>> to text! copy/part x 3 
** Internal Error: invalid UTF-8 byte sequence found during decoding
** Where: to console
** Near: [... copy/part x 3 ~~]
** Line: 1

>> to text! at x 4         
== "ck"

>> copy/part at x 3 1            
== #{FC}

If opened in the Firefox view source window the text is: Glück

gchiu commented
>> to text! read https://www.google.de
** Internal Error: invalid UTF-8 byte sequence found during decoding
** Where: to console
** Near: [... text! read https://www.google.de ~~]
** Line: 1
>> bin-to-string: function [bin [binary!]][
    text: make text! length? bin
    for-each byte bin [append text to char! byte]
    text
]

>> bin-to-string read https://www.google.de
== {<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="sk"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title>

see stackoverflow

Can someone give me an executive summary of this, so I don't have to do too much research?

Is Google serving up invalid UTF-8 (hence Google's problem, IMO) or is it valid (and thus our problem)?

Note: A long time ago I had suggested to BrianH that PARSE seemed a good interface to TRANSCODE. With the "residual" return result, I imagine we could say that:

parse binary [set t text!]

Could be a way of doing as much UTF-8 encoding as you can, and returning the position of any residual bytes. If you get NULL then that means your encoding succeeded all the way. Something to think about.

There is another SO page here: https://stackoverflow.com/questions/47108274/read-https-google-com-doesnt-work-anymore-in-red

I have found both of them previously by a chance when looking for this error. Both claim it is a problem with Google's UTF-8 encoding. I don't know enough about UTF-8 to check myself. But if it would be a problem on Google's side why there are no complaints from people using python, etc. Seems only Rebol/R3-Renc/RED have this problem.

But the fix works, so I didn't investigate further.. ¯_(ツ)_/¯

Btw I do get null when executing the parse command.

Ok, I have found the problem:

$ curl -i https://www.google.de
HTTP/2 200 
date: Fri, 07 Feb 2020 22:33:03 GMT
expires: -1
cache-control: private, max-age=0
content-type: text/html; charset=ISO-8859-1

...snip...

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="sk"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
...snip...

There is this SO commentary regarding the <meta charset=“utf-8”> vs <meta http-equiv=“Content-Type”> in HTML:

  • Noted should be that neither is been used for parsing when the page is served over web. Instead, the one in HTTP Content-Type response header will be used. The meta tag is only used when the page is loaded from local disk file system.

So Google is serving ISO-8859-1 even though the HTML says it is UTF-8..

Well, good to know. :-/ Thanks for digging into it.

I've said that there needs to be a clear organization of the meaning of things like READ vs. LOAD, and how it all works. This is yet-another-piece-of-evidence that READ needs to stay in the world of bytes. LOAD then needs to be able to automatically sense content types and give you what you want, or give you an error if you do not have a codec for it.

Going to have to put some thought into this; one piece of good news is that by being in the browser, we can experiment through the lens of something where all the network basics are taken care for us. Then that design could be reused on the desktop based on the information.

>> to text! copy/part x 1   
== "G"

As an aside @IngoHohmann - the nature of text and binary is now such that they can be aliased between each other with AS. This does not make a copy, while TO does.

So above, you are copying a chunk out of a binary, then making another copy in order to do the TO.

You could build a single disconnected copy from the binary with as text! copy/part x 1.

After AS is used to alias a BINARY! as a TEXT!, however, that binary is constrained to where all modifications must keep it as valid UTF-8. In this case that's obviously not a problem for you, since you didn't store the copy anywhere else and hence can't access it as a binary (unless you alias it back). But clearly, aliasing it back will still have had it aliased as TEXT!, so that binary would also have the constraint.