Issue w/ Encoding while using Fedora
keenan-smith-data opened this issue · 4 comments
Recieved Warning while trying to Scrape text data from various websites.
Warning in rt_request_handler(request = request, on_redirect = on_redirect, :
input string '^()(\s)#' cannot be translated to UTF-8, is it valid in
'ANSI_X3.4-1968'?
This error happens for any subsequent events after the first:
Warning in rt_request_handler(request = request, on_redirect = on_redirect, :
restarting interrupted promise evaluation
jacobin_pull <- function(hyperlink) {
session <- polite::bow(hyperlink)
temp <- polite::scrape(session)
text_data <-
temp |>
rvest::html_element(css = "#post-content") |>
rvest::html_nodes("p") |>
rvest::html_text2() |>
dplyr::as_tibble() |>
dplyr::rename(text = value) |>
return(text_data)
}
jacobin_pull_try <- function(hyperlink) {
tryCatch(
expr = {
message(paste("Trying", hyperlink))
jacobin_pull(hyperlink)
},
error = function(cond) {
message(paste("This URL has caused an error:", hyperlink))
message(cond)
},
warning = function(cond) {
message(paste("URL has a warning:", hyperlink))
message(cond)
},
finally = {
message(paste("Processed URL:", hyperlink))
}
)
}
jacobin_test_link <- "https://jacobin.com/2022/07/we-still-have-to-take-donald-trump-seriously"
jacobin_test_link_2 <- "https://jacobin.com/2022/07/ukraine-russia-war-debt-forgiveness-us-eu"
jacobin_test_3 <- "https://jacobin.com/2022/06/american-exceptionalism-off-the-rails"
jac_test <- jacobin_pull_try(jacobin_test_link)
jac_test_2 <- jacobin_pull_try(jacobin_test_link_2)
jac_test_3 <- jacobin_pull_try(jacobin_test_3)
Sys Environment:
R version 4.1.3 (2022-03-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora Linux 36 (Xfce)Matrix products: default
BLAS/LAPACK: /usr/lib64/libflexiblas.so.3.2locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
I tried to look into the source code to discover the issue but it's outside of my current understanding.
rvest::read_html()
does not tigger the same error.
EDIT: Forgot to mention, ran the same code on windows and did not have the same issue.
I believe this is {robotstxt} issue. @petermeissner can you please look into it?
Hmmm, sounds like a robotstxt issue or even something deeper … did you try running it on other Linux distribution (Debian/Ubuntu) ?
I will look into it
Did not try any Debian distro. Only tried Windows and Fedora 36. I got it to work once on Fedora, but I couldn't reproduce it. I tried changing the locale settings on Fedora with no effect. I changed scraping to read_html()
and got no issue with encoding.
I tried other sites and had the same issue. I have a series of functions like the one written above and I tried each one same series of errors.