Domsi is a powerful web scraping library that allows you to query HTML elements based on DOM hierarchy, element attributes, and CSS styles. Works across *all* automated browsers, so long as they allow execution of arbitrary JavaScript. That includes non-Node.JS browsers such as Selenium too!
Run npm install domsi
.
Download /build/index.source.js and put that into your source code as a string. Evaluate the string in your automated browser before running Domsi queries.
You can use Domsi by importing initDomsi
from the domsi
module. In the following example, we’re using the puppeteer
library to interface with the browser.
const { initDomsi } = require('domsi');
const puppeteer = require('puppeteer');
(async () => {
// Load page with Puppeteer
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.google.com/');
// Initialize domsi in the browser page
await page.evaluate(initDomsi());
// Query the selector in the browser page
// and return results
const results = await page.evaluate(() => {
return domsi.findAll({ tagName: 'img' }).map((result) => result.node.getAttribute('src'));
});
console.log(results);
// Close Puppeteer
await browser.close();
})();
If you know the web page you’re scraping has an existing variable named domsi
, then you can pass in a variable name in initDomsi
to initialize it under a different variable name.
await page.evaluate(initDomsi('domsiAlias'));
await page.evaluate(() => {
return domsiAlias.findAll({ tagName: 'img' }).map((result) => result.node.getAttribute('src'));
});
If you need to run a Domsi query without polluting the global environment, you can run it in an anonymous function. In this case, you no longer need to run the initDomsi
function.
const { runDomsiAnonymously } = require('domsi');
...
await page.evaluate(runDomsiAnonymously((domsi) => {
return domsi.findAll({ tagName: 'img' }).map((result) => result.node.getAttribute('src'));
}));
When called like this, the Domsi library will be initialized in an anonymous function and the query is executed there. Each call will have to re-initialize the Domsi library, so if performance is critical, you are advised to use initDomsi
.
Load the Domsi source file in your project and evaluate it in your automated browser before running Domsi queries. Here is an example in Python 3.
from selenium import webdriver
domsi_src = '''Import Domsi source here'''
driver = webdriver.Chrome()
driver.get("https://www.google.com/")
driver.execute_script(domsi_src + 'window.domsi=domsi;')
results = driver.execute('''
return domsi.findAll({ tagName: 'img' })
.map((result) => result.node.getAttribute('src'));
''')
print(results)
driver.close()
If I have the time, I will make a Python wrapper for initDomsi
and runDomsiAnonymously
too.
domsi.find(domsiSelector, [element])
Returns a DomsiObject
that represents the first element matched. If element is specified, the selector will only search for children of the element (including itself).
domsi.findAll(domsiSelector, [element])
Identical to domsi.find
, except it returns all elements matched.
Under the hood, domsi.find
calls domsi.findAll
and returns the first element.
All matches returned by domsi.find
and domsi.findAll
are DomsiObject
s. They take on the following structure:
{
node: HTMLNode;
children: {
[name: string]: DomsiObject | DomsiObject[];
};
}
Documentation on the children returned can be found in DomsiChildrenSelector.