XPath queries to only include certain parts of a document

Question

XPath queries to only include certain parts of a document

Closed this issue 4 years ago · 9 comments

Currently, any change change on the website that is "bigger" than n characters is considered a change. However, this might not be useful for all website. For instance, some websites show a timestamp or the last page's rendering time in the footer, which continuously changes.

Instead, more advanced mechanisms for change detection are desirable. For instance, one might be able to specify XPath queries to define, which sub-trees to inspect in HTML pages. For non-HTML pages, one could think of applying Regex matching to explicitly include or exclude certain parts of the document for change detection.

Answer 1 · 2019-10-06T08:38:41.000Z

Maybe a sort of bias?
Like certain parts will have a very low threshold and others a very high.

Answer 2 · 2019-10-06T09:14:43.000Z

That's even more advanced. I'd be fine with strict inclusion / exclusion using the XPath / Regex thing mentioned above.

Answer 3 · 2019-10-06T09:19:44.000Z

Can I still present my example?
I mean is it valid?

Answer 4 · 2019-10-06T09:37:13.000Z

What kind of parts of the HTML would you like to prioritize?

Answer 5 · 2019-10-06T22:23:58.000Z

What kind of parts of the HTML would you like to prioritize?

The user should be able to specify an XPath query for which part to include. Or multiple queries for those parts to exclude.

This is definitely going to be a bigger change!

Answer 6 · 2019-10-07T05:09:18.000Z

Mmmmm ok boss!
I'll do my best!

Answer 7 · 2019-10-07T06:48:03.000Z

Take my website as an example. Say I want to use this watcher script to be notified about new blog posts, while I don't care about any changes on the rest of the website.

I could then specify an XPath query like //ul[@class="post-list"] via an additional command-line flag like --include-xpath. This query would could the watcher to parse the HTML DOM and keep only the "blue area" from the screenshot to track changes.

Since XPath is defined for XML in general, this would not only work for HTML pages, but also for RSS- or Atom feeds and more.

So far so clear? What do you think?

💡 Tip: In Firefox (and I guess Chrome as well) you can use the Developer Tools console to run XPath queries against a website by just executing $x('<your-query-here>').

Answer 8 · 2019-10-07T06:51:18.000Z

Python's ElementTree XML API might be useful to parse and query XML.

Answer 9 · 2019-10-07T16:42:35.000Z

It's an interesting way.
I was thinking about RegEx but now that you mention it...