Issue w/ Encoding while using Fedora

Question

Issue w/ Encoding while using Fedora

keenan-smith-data opened this issue 2 years ago · 4 comments

Recieved Warning while trying to Scrape text data from various websites.

Warning in rt_request_handler(request = request, on_redirect = on_redirect, :
input string '^()(\s)#' cannot be translated to UTF-8, is it valid in
'ANSI_X3.4-1968'?

This error happens for any subsequent events after the first:

Warning in rt_request_handler(request = request, on_redirect = on_redirect, :
restarting interrupted promise evaluation

jacobin_pull <- function(hyperlink) {
  session <- polite::bow(hyperlink)
  temp <- polite::scrape(session)
  text_data <-
    temp |>
    rvest::html_element(css = "#post-content") |>
    rvest::html_nodes("p") |>
    rvest::html_text2() |>
    dplyr::as_tibble() |>
    dplyr::rename(text = value) |>
  return(text_data)
}

jacobin_pull_try <- function(hyperlink) {
  tryCatch(
    expr = {
      message(paste("Trying", hyperlink))
      jacobin_pull(hyperlink)
    },
    error = function(cond) {
      message(paste("This URL has caused an error:", hyperlink))
      message(cond)
    },
    warning = function(cond) {
      message(paste("URL has a warning:", hyperlink))
      message(cond)
    },
    finally = {
      message(paste("Processed URL:", hyperlink))
    }
  )
}

jacobin_test_link <- "https://jacobin.com/2022/07/we-still-have-to-take-donald-trump-seriously"

jacobin_test_link_2 <- "https://jacobin.com/2022/07/ukraine-russia-war-debt-forgiveness-us-eu"

jacobin_test_3 <- "https://jacobin.com/2022/06/american-exceptionalism-off-the-rails"

jac_test <- jacobin_pull_try(jacobin_test_link)
jac_test_2 <- jacobin_pull_try(jacobin_test_link_2)
jac_test_3 <- jacobin_pull_try(jacobin_test_3)

Sys Environment:

R version 4.1.3 (2022-03-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora Linux 36 (Xfce)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libflexiblas.so.3.2

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

I tried to look into the source code to discover the issue but it's outside of my current understanding.

rvest::read_html() does not tigger the same error.

EDIT: Forgot to mention, ran the same code on windows and did not have the same issue.

Answer 1 · 2022-08-02T11:20:32.000Z

I believe this is {robotstxt} issue. @petermeissner can you please look into it?

Answer 2 · 2022-08-02T14:12:52.000Z

Hmmm, sounds like a robotstxt issue or even something deeper … did you try running it on other Linux distribution (Debian/Ubuntu) ?

Answer 3 · 2022-08-02T14:14:14.000Z

I will look into it

Answer 4 · 2022-08-02T17:09:05.000Z

Did not try any Debian distro. Only tried Windows and Fedora 36. I got it to work once on Fedora, but I couldn't reproduce it. I tried changing the locale settings on Fedora with no effect. I changed scraping to read_html() and got no issue with encoding.

I tried other sites and had the same issue. I have a series of functions like the one written above and I tried each one same series of errors.