[Bug] Characters like äüö are output incorrectly
jamal2362 opened this issue · 8 comments
I don't think it's a Readability4J issue but that you have to wrap the output in a structure like this to set encoding to UTF-8 (see #2):
<html>
<head>
<meta charset="utf-8" />
</head>
<body>
<!-- output here -->
</body>
</html>
This is exactly what article.getContentWithUtf8Encoding()
does. Does it work for you?
@jamal2362 Is it possible the website uses a charset other than UTF-8 and you don't take that into account when creating your stringBuffer
?
You're right, article.getContentWithUtf8Encoding()
didn't take into account document's charset.
Created now the method article.getContentWithDocumentsCharsetOrUtf8()
which exactly just does that.
But i don't think that will resolve @jamal2362's issue as above document, google.de, has its charset already set to UTF-8.
Try version 1.0.8 if it solves your issue but i think the issue lies somewhere else.
@dankito My apologies, my question was aimed at @jamal2362, sorry if that wasn't clear. I don't think your library does anything wrong. I think the String that's being passed to your library is already wrong, because the code creating the String doesn't check the website encoding.
The same thing actually happened to me and I thought for a while that Readability4J was malfunctioning before realizing it was my own fault :-)
@dankito
Thank you for your work!
Unfortunately, this did not help.
Am I doing something wrong in my code?
Do you also have the problems with "google.de" ?
@michaldvorak79
What does that mean exactly?
What should I change?
@jamal2362 What I mean is this: when you download a web page, you have a byte array, right? But Readability4J requires String
. So you have to convert the byte array to String. And for that you need to know the web page character encoding (or "charset"). Whether it's UTF-8
or Windows-1252
or ISO-8859-1
or what. And you have to let Java know which character encoding the byte array uses, otherwise the String
will not be created correctly. For example, if you have a webpage that uses the ISO encoding and you convert it into String
using the UTF-8
encoding, it will keep regular english characters (as those are the same in both encodings), but it will mangle special characters.
Charset can normally be obtained from the response HTTP headers or it's included in a <meta>
tag in the HTML code.
I don't know what your code looks like exactly and how do you obtain the data in your stringBuffer
, but my theory was that maybe you always create the data in the stringBuffer
as UTF-8
and the websites that give you trouble actually use a different character encoding.
You can check your htmlData
variable after you create it and see whether it contains the proper special characters, or whether they are already mangled. If the special characters are good in your htmlData
and bad in Readability4J's output, then the library is doing something wrong. If the characters are already mangled in htmlData
, then you use the wrong character encoding when turning byte array into String
.
Can you post your code how you download web page's HTML, Jamal?
Maybe this code helps you:
val uri = "https://google.de" // set your url here
val document = Jsoup.parse(URL(uri), 10000)
val readability = Readability4JExtended(uri, document.outerHtml())
val article = readability.parse()