Selector doesn't work with newline after
David-OConnor opened this issue · 8 comments
document.select
with Selector::parse
is not working when there's a newline directly after the the tag.
Code:
let a_sel = scraper::Selector::parse("a").unwrap();
for el in document.select(&a_sel) {
//...
}
HTML example that triggers this:
<a
href="...")"
When printing these affected elements:
Element(<a\n href="\\\"/...
Other elements in the query that are of the form Element(<a href="\\\"/...
don't trigger this problem. Happy for a workaround in the meanwhile.
I can't get that html to even parse. Are you sure that's what you used to trigger the issue?
That's a minimal example. I don't know that's the issue, but that appears to be what's separating tags it finds vs ones it ignores.
Example link it finds:
<a href="https://github.com">
Example link it doesn't find:
<a
href="https://github.com">
That seems to work.
main.rs
:
fn main() {
let html = r#"<a
href="https://github.com">"#;
println!("Raw HTML: {:?}", html);
let document = scraper::Html::parse_document(html);
let a_sel = scraper::Selector::parse("a").unwrap();
for el in document.select(&a_sel) {
println!("{}", el.html());
}
}
Output:
Raw HTML: "<a\nhref=\"https://github.com\">"
<a href="https://github.com"></a>
Hmm. I'll dig deeper and report back; that's equivalent to the code I'm having trouble with
Hi - Sorry about the late reply. I have tried several troubleshooting approaches, and have not been able to narrow this down. I can provide this case to reproduce it:
It will correctly pull the links at the header and footer of the page, but none of the articles linked in the middle will show up using the 'a' selector.
I can't reproduce that.
main.rs
:
fn main() {
let url = "https://www.anyleaf.org/blog";
let html = ureq::get(url).call().unwrap().into_string().unwrap();
println!("Raw HTML: {:?}", html);
let document = scraper::Html::parse_document(&html);
let a_sel = scraper::Selector::parse("a").unwrap();
for el in document.select(&a_sel) {
println!("{}", el.html());
}
}
Cargo.toml
:
[package]
name = "scraper-issue-76"
version = "0.0.0"
edition = "2021"
[dependencies]
scraper = "0.13.0"
ureq = "2.4.0"
Output:
Raw HTML: "\n\n<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n <meta charset=\"utf-8\">\n <meta name=\"viewport\" content=\"width=device-width\">\n\n <sc
ript type=\"module\">\n document.documentElement.classList.remove('no-js');\n document.documentElement.classList.add('js');\n </script>\n\n
<link rel=\"stylesheet\" href=\"/static/style.css\">\n\n\n <meta name=\"description\" content=\"Sensors and measurement for science, hydroponics, and aquariu
ms\">\n <meta property=\"og:locale\" content=\"en_US\">\n <meta property=\"og:type\" content=\"website\">\n <meta name=\"twitter:card\" content=\"summa
ry_large_image\">\n <meta property=\"og:url\" content=\"https://www.anyleaf.org\">\n\n \n <link rel=\"shortcut icon\" type=\"image/png\" href=\"/static
/favicon.png\"/>\n\n \n \n <link rel=\"apple-touch-icon\" href=\"/static/favicon.png\">\n \n <meta name=\"theme-color\" content=\"#a2c8a9\">\n\n
\n <meta name=\"description\" content=\"AnyLeaf Articles: On sensors, measurements, and embedded computing\">\n\n <meta property=\"og:title\" content=\
"\">\n <meta property=\"og:description\" content=\"AnyLeaf Articles: On sensors, measurements, and embedded computing\">\n\n <title>AnyLeaf sensors: Artic
les</title>\n\n\n</head>\n<body>\n\n<div id=\"top-bar\">\n <div id=\"menu\">\n <a href=\"/\" class=\"menu-item\"><h3 class=\"menu-header\">Home</h3></
a>\n <a href=\"/mercury-g4\" class=\"menu-item\"><h3 class=\"menu-header\">Quad FC</h3></a>\n <a href=\"/stove-thermometer\" class=\"menu-item\"><
h3 class=\"menu-header\">Stove Thermometer</h3></a>\n <a href=\"/water-monitor\" class=\"menu-item\"><h3 class=\"menu-header\">Water Monitor</h3></a>\n
<a href=\"/ph-module\" class=\"menu-item\"><h3 class=\"menu-header\">pH</h3></a>\n <a href=\"/ec-module\" class=\"menu-item\"><h3 class=\"menu-head
er\">Conductivity</h3></a>\n <a href=\"/temp-module\" class=\"menu-item\"><h3 class=\"menu-header\">Temperature</h3></a>\n <a class=\"menu-item\"
href=\"/about\"><h3 class=\"menu-header\">About</h3></a>\n <a class=\"menu-item\" href=\"/checkout\"><h3 class=\"menu-header\">Checkout</h3></a>\n
<a class=\"menu-item\" href=\"/blog\"><h3 class=\"menu-header\">Blog</h3></a>\n <a class=\"menu-item\" href=\"mailto:anyleaf@anyleaf.org\"><h3 class=\"m
enu-header\">Contact</h3></a>\n </div>\n</div>\n\n\n\n\n <div class=\"home-body\">\n <div style=\"text-align: center;\">\n <img src=\"/stati
c/logo.png\" style = \"margin-top: 40px\" width=300 alt=\"AnyLeaf\" />\n </div>\n\n <h1>AnyLeaf Blog</h1>\n\n <h2>Misc:</h2>\n <ul>\
n <li style=\"margin-bottom: 40px;\">\n <a\n href=\"/filter-design\"\n style=\"font-size
: 1.5em;\"\n >Digital filter design and response\n </a>\n </li>\n </ul>\n\n <h2>Articles:</h2>\n <
ul>\n \n <li style=\"margin-bottom: 40px;\">\n <a\n href=\"/blog/parts-you-need-for-a-qu
adcopter-in-2022\"\n style=\"font-size: 1.5em\">\n Parts you need for a quadcopter in 2022\n
</a> - Feb. 24, 2022, 7:46 p.m.\n </li>\n \n <li style=\"margin-bottom: 40px;\">\n <a\n
href=\"/blog/writing-embedded-firmware-using-rust\"\n style=\"font-size: 1.5em\">\n Writing e
mbedded firmware using Rust\n </a> - Sept. 25, 2021, 5:45 p.m.\n </li>\n \n <li style=\"margin-botto
m: 40px;\">\n <a\n href=\"/blog/measuring-ph-on-raspberry-pi\"\n style=\"font-size: 1.5
em\">\n Measuring pH on Raspberry Pi\n </a> - Feb. 6, 2021, 9:47 a.m.\n </li>\n \n
<li style=\"margin-bottom: 40px;\">\n <a\n href=\"/blog/the-essence-of-embedded-computers\"\n
style=\"font-size: 1.5em\">\n The essence of embedded computers\n </a> - Sept. 6, 2020, 7:09 p.m.\n
</li>\n \n <li style=\"margin-bottom: 40px;\">\n <a\n href=\"/blog/electrical-
conductivity-(ec)-for-hydroponics\"\n style=\"font-size: 1.5em\">\n Electrical Conductivity (EC) for Hydroponi
cs\n </a> - Aug. 22, 2020, 4 p.m.\n </li>\n \n <li style=\"margin-bottom: 40px;\">\n
<a\n href=\"/blog/project:-building-an-automatic-ph-doser\"\n style=\"font-size: 1.5em\">\n
Project: Building an automatic pH doser\n </a> - July 21, 2020, 7:33 p.m.\n </li>\n \n
<li style=\"margin-bottom: 40px;\">\n <a\n href=\"/blog/ph-measurement-for-hydroponics\"\n
style=\"font-size: 1.5em\">\n pH Measurement for Hydroponics\n </a> - July 19, 2020, 3:43 p.m.\n </
li>\n \n <li style=\"margin-bottom: 40px;\">\n <a\n href=\"/blog/how-to-calibrate-ph-sen
sors\"\n style=\"font-size: 1.5em\">\n How to Calibrate pH Sensors\n </a> - July 17, 2020,
1:23 p.m.\n </li>\n \n <li style=\"margin-bottom: 40px;\">\n <a\n href=\"
/blog/temperature-sensors:-a-comparison\"\n style=\"font-size: 1.5em\">\n Temperature sensors: A comparison\n
</a> - July 15, 2020, 6:42 p.m.\n </li>\n \n </ul>\n </div>\n\n\n\n<div id=\"footer\">\n <h4 style=\"m
argin-top: 30px\">Assembled in Raleigh, NC, USA.</h4>\n <div style=\"margin-bottom: 30px\">\n <a class=\"fineprint\" style=\"margin-right: 20px\" href
=\"/privacy\">Privacy policy</a>\n <a class=fineprint href=\"/terms\">Terms and conditions</a>\n </div>\n <div style=\"display: flex; flex-directio
n: column\">\n <h5 class=\"fineprint\">\n All AnyLeaf products comply with the\n <a href=\"https://en.wikipedia.org/wiki/Restrictio
n_of_Hazardous_Substances_Directive\">\n Restriction of Hazardous Substances (RoHS) Directive</a>.</h5>\n <h5 class=\"fineprint\">© 2022 A
nyLeaf</h5>\n </div>\n</div>\n\n\n<script src=\"/static/js/main.js\"></script>\n<script src=\"/static/js/cart.js\"></script>\n\n</body>\n</html>"
<a href="/" class="menu-item"><h3 class="menu-header">Home</h3></a>
<a class="menu-item" href="/mercury-g4"><h3 class="menu-header">Quad FC</h3></a>
<a class="menu-item" href="/stove-thermometer"><h3 class="menu-header">Stove Thermometer</h3></a>
<a class="menu-item" href="/water-monitor"><h3 class="menu-header">Water Monitor</h3></a>
<a class="menu-item" href="/ph-module"><h3 class="menu-header">pH</h3></a>
<a class="menu-item" href="/ec-module"><h3 class="menu-header">Conductivity</h3></a>
<a href="/temp-module" class="menu-item"><h3 class="menu-header">Temperature</h3></a>
<a class="menu-item" href="/about"><h3 class="menu-header">About</h3></a>
<a class="menu-item" href="/checkout"><h3 class="menu-header">Checkout</h3></a>
<a class="menu-item" href="/blog"><h3 class="menu-header">Blog</h3></a>
<a href="mailto:anyleaf@anyleaf.org" class="menu-item"><h3 class="menu-header">Contact</h3></a>
<a style="font-size: 1.5em;" href="/filter-design">Digital filter design and response
</a>
<a href="/blog/parts-you-need-for-a-quadcopter-in-2022" style="font-size: 1.5em">
Parts you need for a quadcopter in 2022
</a>
<a href="/blog/writing-embedded-firmware-using-rust" style="font-size: 1.5em">
Writing embedded firmware using Rust
</a>
<a href="/blog/measuring-ph-on-raspberry-pi" style="font-size: 1.5em">
Measuring pH on Raspberry Pi
</a>
<a href="/blog/the-essence-of-embedded-computers" style="font-size: 1.5em">
The essence of embedded computers
</a>
<a href="/blog/electrical-conductivity-(ec)-for-hydroponics" style="font-size: 1.5em">
Electrical Conductivity (EC) for Hydroponics
</a>
<a href="/blog/project:-building-an-automatic-ph-doser" style="font-size: 1.5em">
Project: Building an automatic pH doser
</a>
<a style="font-size: 1.5em" href="/blog/ph-measurement-for-hydroponics">
pH Measurement for Hydroponics
</a>
<a href="/blog/how-to-calibrate-ph-sensors" style="font-size: 1.5em">
How to Calibrate pH Sensors
</a>
<a href="/blog/temperature-sensors:-a-comparison" style="font-size: 1.5em">
Temperature sensors: A comparison
</a>
<a class="fineprint" style="margin-right: 20px" href="/privacy">Privacy policy</a>
<a class="fineprint" href="/terms">Terms and conditions</a>
<a href="https://en.wikipedia.org/wiki/Restriction_of_Hazardous_Substances_Directive">
Restriction of Hazardous Substances (RoHS) Directive</a>
Thanks for looking! Not sure what's up. I'll work between your code and mine and see where the disconnect is.
I've also added a test for this (#82) so I'm reasonably confident it's not a bug. Please do let us know if this remains a problem.