
Selector doesn't work with newline after

David-OConnor opened this issue · 8 comments with Selector::parse is not working when there's a newline directly after the the tag.


let a_sel = scraper::Selector::parse("a").unwrap();
for el in {

HTML example that triggers this:


When printing these affected elements:

Element(<a\n href="\\\"/...

Other elements in the query that are of the form Element(<a href="\\\"/... don't trigger this problem. Happy for a workaround in the meanwhile.

I can't get that html to even parse. Are you sure that's what you used to trigger the issue?

That's a minimal example. I don't know that's the issue, but that appears to be what's separating tags it finds vs ones it ignores.

Example link it finds:

<a href="">

Example link it doesn't find:


That seems to work.

fn main() {
    let html = r#"<a

    println!("Raw HTML: {:?}", html);

    let document = scraper::Html::parse_document(html);
    let a_sel = scraper::Selector::parse("a").unwrap();
    for el in {
        println!("{}", el.html());


Raw HTML: "<a\nhref=\"\">"
<a href=""></a>

Hmm. I'll dig deeper and report back; that's equivalent to the code I'm having trouble with

Hi - Sorry about the late reply. I have tried several troubleshooting approaches, and have not been able to narrow this down. I can provide this case to reproduce it:

It will correctly pull the links at the header and footer of the page, but none of the articles linked in the middle will show up using the 'a' selector.

I can't reproduce that.

fn main() {
    let url = "";
    let html = ureq::get(url).call().unwrap().into_string().unwrap();

    println!("Raw HTML: {:?}", html);

    let document = scraper::Html::parse_document(&html);
    let a_sel = scraper::Selector::parse("a").unwrap();
    for el in {
        println!("{}", el.html());


name = "scraper-issue-76"
version = "0.0.0"
edition = "2021"

scraper = "0.13.0"
ureq = "2.4.0"


Raw HTML: "\n\n<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n    <meta charset=\"utf-8\">\n    <meta name=\"viewport\" content=\"width=device-width\">\n\n    <sc
ript type=\"module\">\n        document.documentElement.classList.remove('no-js');\n        document.documentElement.classList.add('js');\n    </script>\n\n
<link rel=\"stylesheet\" href=\"/static/style.css\">\n\n\n    <meta name=\"description\" content=\"Sensors and measurement for science, hydroponics, and aquariu
ms\">\n    <meta property=\"og:locale\" content=\"en_US\">\n    <meta property=\"og:type\" content=\"website\">\n    <meta name=\"twitter:card\" content=\"summa
ry_large_image\">\n    <meta property=\"og:url\" content=\"\">\n\n    \n    <link rel=\"shortcut icon\" type=\"image/png\" href=\"/static
/favicon.png\"/>\n\n    \n    \n    <link rel=\"apple-touch-icon\" href=\"/static/favicon.png\">\n    \n    <meta name=\"theme-color\" content=\"#a2c8a9\">\n\n
   \n    <meta name=\"description\" content=\"AnyLeaf Articles: On sensors, measurements, and embedded computing\">\n\n    <meta property=\"og:title\" content=\
"\">\n    <meta property=\"og:description\" content=\"AnyLeaf Articles: On sensors, measurements, and embedded computing\">\n\n    <title>AnyLeaf sensors: Artic
les</title>\n\n\n</head>\n<body>\n\n<div id=\"top-bar\">\n    <div id=\"menu\">\n        <a href=\"/\" class=\"menu-item\"><h3 class=\"menu-header\">Home</h3></
a>\n        <a href=\"/mercury-g4\" class=\"menu-item\"><h3 class=\"menu-header\">Quad FC</h3></a>\n        <a href=\"/stove-thermometer\" class=\"menu-item\"><
h3 class=\"menu-header\">Stove Thermometer</h3></a>\n        <a href=\"/water-monitor\" class=\"menu-item\"><h3 class=\"menu-header\">Water Monitor</h3></a>\n
      <a href=\"/ph-module\" class=\"menu-item\"><h3 class=\"menu-header\">pH</h3></a>\n        <a href=\"/ec-module\" class=\"menu-item\"><h3 class=\"menu-head
er\">Conductivity</h3></a>\n        <a href=\"/temp-module\" class=\"menu-item\"><h3 class=\"menu-header\">Temperature</h3></a>\n        <a class=\"menu-item\"
href=\"/about\"><h3 class=\"menu-header\">About</h3></a>\n        <a class=\"menu-item\" href=\"/checkout\"><h3 class=\"menu-header\">Checkout</h3></a>\n
 <a class=\"menu-item\" href=\"/blog\"><h3 class=\"menu-header\">Blog</h3></a>\n        <a class=\"menu-item\" href=\"\"><h3 class=\"m
enu-header\">Contact</h3></a>\n    </div>\n</div>\n\n\n\n\n    <div class=\"home-body\">\n        <div style=\"text-align: center;\">\n        <img src=\"/stati
c/logo.png\" style = \"margin-top: 40px\" width=300 alt=\"AnyLeaf\" />\n        </div>\n\n        <h1>AnyLeaf Blog</h1>\n\n        <h2>Misc:</h2>\n        <ul>\
n            <li style=\"margin-bottom: 40px;\">\n                <a\n                        href=\"/filter-design\"\n                        style=\"font-size
: 1.5em;\"\n                >Digital filter design and response\n                </a>\n            </li>\n        </ul>\n\n        <h2>Articles:</h2>\n        <
ul>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/parts-you-need-for-a-qu
adcopter-in-2022\"\n                            style=\"font-size: 1.5em\">\n                        Parts you need for a quadcopter in 2022\n
  </a> - Feb. 24, 2022, 7:46 p.m.\n                </li>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n
                 href=\"/blog/writing-embedded-firmware-using-rust\"\n                            style=\"font-size: 1.5em\">\n                        Writing e
mbedded firmware using Rust\n                    </a> - Sept. 25, 2021, 5:45 p.m.\n                </li>\n            \n                <li style=\"margin-botto
m: 40px;\">\n                    <a\n                            href=\"/blog/measuring-ph-on-raspberry-pi\"\n                            style=\"font-size: 1.5
em\">\n                        Measuring pH on Raspberry Pi\n                    </a> - Feb. 6, 2021, 9:47 a.m.\n                </li>\n            \n
      <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/the-essence-of-embedded-computers\"\n
             style=\"font-size: 1.5em\">\n                        The essence of embedded computers\n                    </a> - Sept. 6, 2020, 7:09 p.m.\n
          </li>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/electrical-
conductivity-(ec)-for-hydroponics\"\n                            style=\"font-size: 1.5em\">\n                        Electrical Conductivity (EC) for Hydroponi
cs\n                    </a> - Aug. 22, 2020, 4 p.m.\n                </li>\n            \n                <li style=\"margin-bottom: 40px;\">\n
    <a\n                            href=\"/blog/project:-building-an-automatic-ph-doser\"\n                            style=\"font-size: 1.5em\">\n
             Project: Building an automatic pH doser\n                    </a> - July 21, 2020, 7:33 p.m.\n                </li>\n            \n
<li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/ph-measurement-for-hydroponics\"\n
    style=\"font-size: 1.5em\">\n                        pH Measurement for Hydroponics\n                    </a> - July 19, 2020, 3:43 p.m.\n                </
li>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/how-to-calibrate-ph-sen
sors\"\n                            style=\"font-size: 1.5em\">\n                        How to Calibrate pH Sensors\n                    </a> - July 17, 2020,
1:23 p.m.\n                </li>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"
/blog/temperature-sensors:-a-comparison\"\n                            style=\"font-size: 1.5em\">\n                        Temperature sensors: A comparison\n
                   </a> - July 15, 2020, 6:42 p.m.\n                </li>\n            \n        </ul>\n    </div>\n\n\n\n<div id=\"footer\">\n    <h4 style=\"m
argin-top: 30px\">Assembled in Raleigh, NC, USA.</h4>\n    <div style=\"margin-bottom: 30px\">\n        <a class=\"fineprint\" style=\"margin-right: 20px\" href
=\"/privacy\">Privacy policy</a>\n        <a class=fineprint href=\"/terms\">Terms and conditions</a>\n    </div>\n    <div style=\"display: flex; flex-directio
n: column\">\n        <h5 class=\"fineprint\">\n            All AnyLeaf products comply with the\n            <a href=\"
n_of_Hazardous_Substances_Directive\">\n                Restriction of Hazardous Substances (RoHS) Directive</a>.</h5>\n        <h5 class=\"fineprint\">© 2022 A
nyLeaf</h5>\n    </div>\n</div>\n\n\n<script src=\"/static/js/main.js\"></script>\n<script src=\"/static/js/cart.js\"></script>\n\n</body>\n</html>"
<a href="/" class="menu-item"><h3 class="menu-header">Home</h3></a>
<a class="menu-item" href="/mercury-g4"><h3 class="menu-header">Quad FC</h3></a>
<a class="menu-item" href="/stove-thermometer"><h3 class="menu-header">Stove Thermometer</h3></a>
<a class="menu-item" href="/water-monitor"><h3 class="menu-header">Water Monitor</h3></a>
<a class="menu-item" href="/ph-module"><h3 class="menu-header">pH</h3></a>
<a class="menu-item" href="/ec-module"><h3 class="menu-header">Conductivity</h3></a>
<a href="/temp-module" class="menu-item"><h3 class="menu-header">Temperature</h3></a>
<a class="menu-item" href="/about"><h3 class="menu-header">About</h3></a>
<a class="menu-item" href="/checkout"><h3 class="menu-header">Checkout</h3></a>
<a class="menu-item" href="/blog"><h3 class="menu-header">Blog</h3></a>
<a href="" class="menu-item"><h3 class="menu-header">Contact</h3></a>
<a style="font-size: 1.5em;" href="/filter-design">Digital filter design and response
<a href="/blog/parts-you-need-for-a-quadcopter-in-2022" style="font-size: 1.5em">
                        Parts you need for a quadcopter in 2022
<a href="/blog/writing-embedded-firmware-using-rust" style="font-size: 1.5em">
                        Writing embedded firmware using Rust
<a href="/blog/measuring-ph-on-raspberry-pi" style="font-size: 1.5em">
                        Measuring pH on Raspberry Pi
<a href="/blog/the-essence-of-embedded-computers" style="font-size: 1.5em">
                        The essence of embedded computers
<a href="/blog/electrical-conductivity-(ec)-for-hydroponics" style="font-size: 1.5em">
                        Electrical Conductivity (EC) for Hydroponics
<a href="/blog/project:-building-an-automatic-ph-doser" style="font-size: 1.5em">
                        Project: Building an automatic pH doser
<a style="font-size: 1.5em" href="/blog/ph-measurement-for-hydroponics">
                        pH Measurement for Hydroponics
<a href="/blog/how-to-calibrate-ph-sensors" style="font-size: 1.5em">
                        How to Calibrate pH Sensors
<a href="/blog/temperature-sensors:-a-comparison" style="font-size: 1.5em">
                        Temperature sensors: A comparison
<a class="fineprint" style="margin-right: 20px" href="/privacy">Privacy policy</a>
<a class="fineprint" href="/terms">Terms and conditions</a>
<a href="">
                Restriction of Hazardous Substances (RoHS) Directive</a>

Thanks for looking! Not sure what's up. I'll work between your code and mine and see where the disconnect is.

I've also added a test for this (#82) so I'm reasonably confident it's not a bug. Please do let us know if this remains a problem.