rust-scraper/scraper

Selector doesn't work with newline after

David-OConnor opened this issue · 8 comments

document.select with Selector::parse is not working when there's a newline directly after the the tag.

Code:

let a_sel = scraper::Selector::parse("a").unwrap();
for el in document.select(&a_sel) {
    //...
}

HTML example that triggers this:

<a
                            href="...")"

When printing these affected elements:

Element(<a\n href="\\\"/...

Other elements in the query that are of the form Element(<a href="\\\"/... don't trigger this problem. Happy for a workaround in the meanwhile.

I can't get that html to even parse. Are you sure that's what you used to trigger the issue?

That's a minimal example. I don't know that's the issue, but that appears to be what's separating tags it finds vs ones it ignores.

Example link it finds:

<a href="https://github.com">

Example link it doesn't find:

<a
    href="https://github.com">

That seems to work.

main.rs:

fn main() {
    let html = r#"<a
href="https://github.com">"#;

    println!("Raw HTML: {:?}", html);

    let document = scraper::Html::parse_document(html);
    let a_sel = scraper::Selector::parse("a").unwrap();
    for el in document.select(&a_sel) {
        println!("{}", el.html());
    }
}

Output:

Raw HTML: "<a\nhref=\"https://github.com\">"
<a href="https://github.com"></a>

Hmm. I'll dig deeper and report back; that's equivalent to the code I'm having trouble with

Hi - Sorry about the late reply. I have tried several troubleshooting approaches, and have not been able to narrow this down. I can provide this case to reproduce it:

https://www.anyleaf.org/blog

It will correctly pull the links at the header and footer of the page, but none of the articles linked in the middle will show up using the 'a' selector.

I can't reproduce that.
main.rs:

fn main() {
    let url = "https://www.anyleaf.org/blog";
    let html = ureq::get(url).call().unwrap().into_string().unwrap();

    println!("Raw HTML: {:?}", html);

    let document = scraper::Html::parse_document(&html);
    let a_sel = scraper::Selector::parse("a").unwrap();
    for el in document.select(&a_sel) {
        println!("{}", el.html());
    }
}

Cargo.toml:

[package]
name = "scraper-issue-76"
version = "0.0.0"
edition = "2021"

[dependencies]
scraper = "0.13.0"
ureq = "2.4.0"

Output:

Raw HTML: "\n\n<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n    <meta charset=\"utf-8\">\n    <meta name=\"viewport\" content=\"width=device-width\">\n\n    <sc
ript type=\"module\">\n        document.documentElement.classList.remove('no-js');\n        document.documentElement.classList.add('js');\n    </script>\n\n
<link rel=\"stylesheet\" href=\"/static/style.css\">\n\n\n    <meta name=\"description\" content=\"Sensors and measurement for science, hydroponics, and aquariu
ms\">\n    <meta property=\"og:locale\" content=\"en_US\">\n    <meta property=\"og:type\" content=\"website\">\n    <meta name=\"twitter:card\" content=\"summa
ry_large_image\">\n    <meta property=\"og:url\" content=\"https://www.anyleaf.org\">\n\n    \n    <link rel=\"shortcut icon\" type=\"image/png\" href=\"/static
/favicon.png\"/>\n\n    \n    \n    <link rel=\"apple-touch-icon\" href=\"/static/favicon.png\">\n    \n    <meta name=\"theme-color\" content=\"#a2c8a9\">\n\n
   \n    <meta name=\"description\" content=\"AnyLeaf Articles: On sensors, measurements, and embedded computing\">\n\n    <meta property=\"og:title\" content=\
"\">\n    <meta property=\"og:description\" content=\"AnyLeaf Articles: On sensors, measurements, and embedded computing\">\n\n    <title>AnyLeaf sensors: Artic
les</title>\n\n\n</head>\n<body>\n\n<div id=\"top-bar\">\n    <div id=\"menu\">\n        <a href=\"/\" class=\"menu-item\"><h3 class=\"menu-header\">Home</h3></
a>\n        <a href=\"/mercury-g4\" class=\"menu-item\"><h3 class=\"menu-header\">Quad FC</h3></a>\n        <a href=\"/stove-thermometer\" class=\"menu-item\"><
h3 class=\"menu-header\">Stove Thermometer</h3></a>\n        <a href=\"/water-monitor\" class=\"menu-item\"><h3 class=\"menu-header\">Water Monitor</h3></a>\n
      <a href=\"/ph-module\" class=\"menu-item\"><h3 class=\"menu-header\">pH</h3></a>\n        <a href=\"/ec-module\" class=\"menu-item\"><h3 class=\"menu-head
er\">Conductivity</h3></a>\n        <a href=\"/temp-module\" class=\"menu-item\"><h3 class=\"menu-header\">Temperature</h3></a>\n        <a class=\"menu-item\"
href=\"/about\"><h3 class=\"menu-header\">About</h3></a>\n        <a class=\"menu-item\" href=\"/checkout\"><h3 class=\"menu-header\">Checkout</h3></a>\n
 <a class=\"menu-item\" href=\"/blog\"><h3 class=\"menu-header\">Blog</h3></a>\n        <a class=\"menu-item\" href=\"mailto:anyleaf@anyleaf.org\"><h3 class=\"m
enu-header\">Contact</h3></a>\n    </div>\n</div>\n\n\n\n\n    <div class=\"home-body\">\n        <div style=\"text-align: center;\">\n        <img src=\"/stati
c/logo.png\" style = \"margin-top: 40px\" width=300 alt=\"AnyLeaf\" />\n        </div>\n\n        <h1>AnyLeaf Blog</h1>\n\n        <h2>Misc:</h2>\n        <ul>\
n            <li style=\"margin-bottom: 40px;\">\n                <a\n                        href=\"/filter-design\"\n                        style=\"font-size
: 1.5em;\"\n                >Digital filter design and response\n                </a>\n            </li>\n        </ul>\n\n        <h2>Articles:</h2>\n        <
ul>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/parts-you-need-for-a-qu
adcopter-in-2022\"\n                            style=\"font-size: 1.5em\">\n                        Parts you need for a quadcopter in 2022\n
  </a> - Feb. 24, 2022, 7:46 p.m.\n                </li>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n
                 href=\"/blog/writing-embedded-firmware-using-rust\"\n                            style=\"font-size: 1.5em\">\n                        Writing e
mbedded firmware using Rust\n                    </a> - Sept. 25, 2021, 5:45 p.m.\n                </li>\n            \n                <li style=\"margin-botto
m: 40px;\">\n                    <a\n                            href=\"/blog/measuring-ph-on-raspberry-pi\"\n                            style=\"font-size: 1.5
em\">\n                        Measuring pH on Raspberry Pi\n                    </a> - Feb. 6, 2021, 9:47 a.m.\n                </li>\n            \n
      <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/the-essence-of-embedded-computers\"\n
             style=\"font-size: 1.5em\">\n                        The essence of embedded computers\n                    </a> - Sept. 6, 2020, 7:09 p.m.\n
          </li>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/electrical-
conductivity-(ec)-for-hydroponics\"\n                            style=\"font-size: 1.5em\">\n                        Electrical Conductivity (EC) for Hydroponi
cs\n                    </a> - Aug. 22, 2020, 4 p.m.\n                </li>\n            \n                <li style=\"margin-bottom: 40px;\">\n
    <a\n                            href=\"/blog/project:-building-an-automatic-ph-doser\"\n                            style=\"font-size: 1.5em\">\n
             Project: Building an automatic pH doser\n                    </a> - July 21, 2020, 7:33 p.m.\n                </li>\n            \n
<li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/ph-measurement-for-hydroponics\"\n
    style=\"font-size: 1.5em\">\n                        pH Measurement for Hydroponics\n                    </a> - July 19, 2020, 3:43 p.m.\n                </
li>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"/blog/how-to-calibrate-ph-sen
sors\"\n                            style=\"font-size: 1.5em\">\n                        How to Calibrate pH Sensors\n                    </a> - July 17, 2020,
1:23 p.m.\n                </li>\n            \n                <li style=\"margin-bottom: 40px;\">\n                    <a\n                            href=\"
/blog/temperature-sensors:-a-comparison\"\n                            style=\"font-size: 1.5em\">\n                        Temperature sensors: A comparison\n
                   </a> - July 15, 2020, 6:42 p.m.\n                </li>\n            \n        </ul>\n    </div>\n\n\n\n<div id=\"footer\">\n    <h4 style=\"m
argin-top: 30px\">Assembled in Raleigh, NC, USA.</h4>\n    <div style=\"margin-bottom: 30px\">\n        <a class=\"fineprint\" style=\"margin-right: 20px\" href
=\"/privacy\">Privacy policy</a>\n        <a class=fineprint href=\"/terms\">Terms and conditions</a>\n    </div>\n    <div style=\"display: flex; flex-directio
n: column\">\n        <h5 class=\"fineprint\">\n            All AnyLeaf products comply with the\n            <a href=\"https://en.wikipedia.org/wiki/Restrictio
n_of_Hazardous_Substances_Directive\">\n                Restriction of Hazardous Substances (RoHS) Directive</a>.</h5>\n        <h5 class=\"fineprint\">© 2022 A
nyLeaf</h5>\n    </div>\n</div>\n\n\n<script src=\"/static/js/main.js\"></script>\n<script src=\"/static/js/cart.js\"></script>\n\n</body>\n</html>"
<a href="/" class="menu-item"><h3 class="menu-header">Home</h3></a>
<a class="menu-item" href="/mercury-g4"><h3 class="menu-header">Quad FC</h3></a>
<a class="menu-item" href="/stove-thermometer"><h3 class="menu-header">Stove Thermometer</h3></a>
<a class="menu-item" href="/water-monitor"><h3 class="menu-header">Water Monitor</h3></a>
<a class="menu-item" href="/ph-module"><h3 class="menu-header">pH</h3></a>
<a class="menu-item" href="/ec-module"><h3 class="menu-header">Conductivity</h3></a>
<a href="/temp-module" class="menu-item"><h3 class="menu-header">Temperature</h3></a>
<a class="menu-item" href="/about"><h3 class="menu-header">About</h3></a>
<a class="menu-item" href="/checkout"><h3 class="menu-header">Checkout</h3></a>
<a class="menu-item" href="/blog"><h3 class="menu-header">Blog</h3></a>
<a href="mailto:anyleaf@anyleaf.org" class="menu-item"><h3 class="menu-header">Contact</h3></a>
<a style="font-size: 1.5em;" href="/filter-design">Digital filter design and response
                </a>
<a href="/blog/parts-you-need-for-a-quadcopter-in-2022" style="font-size: 1.5em">
                        Parts you need for a quadcopter in 2022
                    </a>
<a href="/blog/writing-embedded-firmware-using-rust" style="font-size: 1.5em">
                        Writing embedded firmware using Rust
                    </a>
<a href="/blog/measuring-ph-on-raspberry-pi" style="font-size: 1.5em">
                        Measuring pH on Raspberry Pi
                    </a>
<a href="/blog/the-essence-of-embedded-computers" style="font-size: 1.5em">
                        The essence of embedded computers
                    </a>
<a href="/blog/electrical-conductivity-(ec)-for-hydroponics" style="font-size: 1.5em">
                        Electrical Conductivity (EC) for Hydroponics
                    </a>
<a href="/blog/project:-building-an-automatic-ph-doser" style="font-size: 1.5em">
                        Project: Building an automatic pH doser
                    </a>
<a style="font-size: 1.5em" href="/blog/ph-measurement-for-hydroponics">
                        pH Measurement for Hydroponics
                    </a>
<a href="/blog/how-to-calibrate-ph-sensors" style="font-size: 1.5em">
                        How to Calibrate pH Sensors
                    </a>
<a href="/blog/temperature-sensors:-a-comparison" style="font-size: 1.5em">
                        Temperature sensors: A comparison
                    </a>
<a class="fineprint" style="margin-right: 20px" href="/privacy">Privacy policy</a>
<a class="fineprint" href="/terms">Terms and conditions</a>
<a href="https://en.wikipedia.org/wiki/Restriction_of_Hazardous_Substances_Directive">
                Restriction of Hazardous Substances (RoHS) Directive</a>

Thanks for looking! Not sure what's up. I'll work between your code and mine and see where the disconnect is.

I've also added a test for this (#82) so I'm reasonably confident it's not a bug. Please do let us know if this remains a problem.