How can I encode url in different character encoding?
alphonse1234 opened this issue · 7 comments
When I try to parse url, sometimes it shows me an error like
couldn’t be opened because the text encoding of its contents can’t be determined.
how can I encode to EUC-KR or different character encoding when it fails to encode utf-8?
func getStringFromHtml(urlString : String) -> String {
let url = URL(string: urlString)!
var result = ""
do {
let html = try String(contentsOf: url)
let doc: Document = try SwiftSoup.parse(html)
let meta: Element = try doc.select("meta[property=og:title]").first()!
let text: String = try meta.attr("content")
result = text
} catch {
print("error")
}
return result
}
And I have one more issue.
How can I select not specific element , but that contains specific string?
For example,
In html I’m trying to parse , there is a
<meta property=“og:title” content =“STRING I WANT” />
But sometimes the url has no such ‘property’ , but has
<meta name=“twitter:title” content=“STRING I WANT” />
So what I want to do is,
Search meta element , and search content stirng with property that contains “:title” .
Thank you.
I realized that with some urls, main domain is parsing with no error but with trailing folders can't parse.
for example,
https://ytn.co.kr
prints parsed html doc.
https://ytn.co.kr/_ln/0101_202003291216267376
prints NSCocoaErrorDomain Code=264 "The file “0101_202003291216267376” couldn’t be opened because the text encoding of its contents can’t be determined."
and here is another case.
https://www.bodnara.co.kr
https://www.bodnara.co.kr/bbs/article.html?num=162106
with this url both print same error
NSCocoaErrorDomain Code=264 "The file couldn’t be opened because the text encoding of its contents can’t be determined."
same domain but different result.
Error 264 means String couldn't determine the encoding of the file/website html.
This error is delineated here let html = try String(contentsOf: url)
, this code i not part of the SwiftSoup library but in Apple Foundation framework.
I suggest you to try to specify encoding.
let html = try String(contentsOf:url, encoding: String.Encoding.utf16)
this work for your site https://ytn.co.kr/_ln/0101_202003291216267376
Thank you! I’ll try.
And do you have an idea for second issue?
try to see Use selector syntax to find elements
https://github.com/scinfu/SwiftSoup#use-selector-syntax-to-find-elements
meta[property~=:title]
this regex might help you, remember it take all titles like:
<meta property="og:title" content="YTN">
<meta property="twitter:title" content="YTN">
<meta property="foo:title" content="YTN">
<meta property="bar:title" content="YTN">
or you can take only twitter
and og
with this meta[property~=twitter:title|og:title]
Thanks. using selector syntax , could solve the problem, but with urls, still have problem.
with the utf16 , it gives html file but with unreadable characters.
Try with a GET
I close this issue, you you need reopen it