scinfu/SwiftSoup

How can I encode url in different character encoding?

alphonse1234 opened this issue · 7 comments

When I try to parse url, sometimes it shows me an error like

couldn’t be opened because the text encoding of its contents can’t be determined.

how can I encode to EUC-KR or different character encoding when it fails to encode utf-8?

func getStringFromHtml(urlString : String) -> String {

    let url = URL(string: urlString)!
    
    var result = ""
    
    do {
        let html = try String(contentsOf: url)
        let doc: Document = try SwiftSoup.parse(html)

        let meta: Element = try doc.select("meta[property=og:title]").first()!
       let text: String = try meta.attr("content")
        result = text
    } catch {
        print("error")
    }
    return result
}

And I have one more issue.

How can I select not specific element , but that contains specific string?

For example,

In html I’m trying to parse , there is a
<meta property=“og:title” content =“STRING I WANT” />
But sometimes the url has no such ‘property’ , but has
<meta name=“twitter:title” content=“STRING I WANT” />

So what I want to do is,

Search meta element , and search content stirng with property that contains “:title” .

Thank you.

I realized that with some urls, main domain is parsing with no error but with trailing folders can't parse.
for example,


https://ytn.co.kr  
prints parsed html doc.
https://ytn.co.kr/_ln/0101_202003291216267376  
prints NSCocoaErrorDomain Code=264 "The file “0101_202003291216267376” couldn’t be opened because the text encoding of its contents can’t be determined."

and here is another case.

https://www.bodnara.co.kr
https://www.bodnara.co.kr/bbs/article.html?num=162106

with this url both print same error

NSCocoaErrorDomain Code=264 "The file couldn’t be opened because the text encoding of its contents can’t be determined."

same domain but different result.

Error 264 means String couldn't determine the encoding of the file/website html.
This error is delineated here let html = try String(contentsOf: url), this code i not part of the SwiftSoup library but in Apple Foundation framework.

I suggest you to try to specify encoding.
let html = try String(contentsOf:url, encoding: String.Encoding.utf16) this work for your site https://ytn.co.kr/_ln/0101_202003291216267376

Thank you! I’ll try.
And do you have an idea for second issue?

try to see Use selector syntax to find elements
https://github.com/scinfu/SwiftSoup#use-selector-syntax-to-find-elements

meta[property~=:title] this regex might help you, remember it take all titles like:

<meta property="og:title" content="YTN">
<meta property="twitter:title" content="YTN">
<meta property="foo:title" content="YTN">
<meta property="bar:title" content="YTN">

or you can take only twitter and og with this meta[property~=twitter:title|og:title]

Thanks. using selector syntax , could solve the problem, but with urls, still have problem.
with the utf16 , it gives html file but with unreadable characters.

Try with a GET

I close this issue, you you need reopen it