muety/website-watcher

XPath queries to only include certain parts of a document

Closed this issue · 9 comments

muety commented

Currently, any change change on the website that is "bigger" than n characters is considered a change. However, this might not be useful for all website. For instance, some websites show a timestamp or the last page's rendering time in the footer, which continuously changes.

Instead, more advanced mechanisms for change detection are desirable. For instance, one might be able to specify XPath queries to define, which sub-trees to inspect in HTML pages. For non-HTML pages, one could think of applying Regex matching to explicitly include or exclude certain parts of the document for change detection.

Maybe a sort of bias?
Like certain parts will have a very low threshold and others a very high.

muety commented

That's even more advanced. I'd be fine with strict inclusion / exclusion using the XPath / Regex thing mentioned above.

Can I still present my example?
I mean is it valid?

What kind of parts of the HTML would you like to prioritize?

muety commented

What kind of parts of the HTML would you like to prioritize?

The user should be able to specify an XPath query for which part to include. Or multiple queries for those parts to exclude.

This is definitely going to be a bigger change!

Mmmmm ok boss!
I'll do my best!

muety commented

Take my website as an example. Say I want to use this watcher script to be notified about new blog posts, while I don't care about any changes on the rest of the website.

Screenshot from 2019-10-07 08-37-16

I could then specify an XPath query like //ul[@class="post-list"] via an additional command-line flag like --include-xpath. This query would could the watcher to parse the HTML DOM and keep only the "blue area" from the screenshot to track changes.

Since XPath is defined for XML in general, this would not only work for HTML pages, but also for RSS- or Atom feeds and more.

So far so clear? What do you think?

💡 Tip: In Firefox (and I guess Chrome as well) you can use the Developer Tools console to run XPath queries against a website by just executing $x('<your-query-here>').

muety commented

Python's ElementTree XML API might be useful to parse and query XML.

It's an interesting way.
I was thinking about RegEx but now that you mention it...