GovTechSG/oobee

Make crawls efficient and fast, improve issue coverage by 23%, and domain coverage

Closed this issue · 1 comments

Hi, I saw this project recommended on my Github feed. I thought id share this project to help save some time and effort on some challenges that were solved at scale so our environment does not suffer since web automated testing is not a simple thing that should be done in one language ( nodejs is not meant for concurrency - runtime is forked each pool and does not scale - also the community usually confuses concurrency with parallel they are not the same thing ).
The job requires a micro-service setup to scale (crawler Rust,C, C++ etc, HTML needs dedicated browsers with configs that render background processing and script transformation to prevent abuse, RPC streams to communicate REST does not scale with this type of workload and should not be done, security layer, resource control when headless rendering, algorithms that are efficient most of the ones today have loops on loops on loops with no data structure architecting, bad mutations, ownership instead of using iterators, and etc, - we rewrote the old runners and made them efficient when people want to use things like axe for results or codesniffer, and much more. Are goal is to be accurate and performant without any drips or leaks along the way ).

We left a Lite version for the community to use as it is far superior than any combination of tool to get the job done for dynamic automated web accessibility testing with 23% more coverage than alternatives and the most efficient crawling to scale https://github.com/a11ywatch/a11ywatch.

  • Here is a video of a crawl completing with 63% accessibility coverage, subdomain and TLDs coverage, code fixes, detailed AI alt enhancement, and most importantly the most efficient crawling at scale ( millions of pages within seconds to minutes ).
demo.mp4

I hope this helps solve some of the challenges being built as the system has many ways to integrate. We actually built several technologies along the way that are used in big tech companies today ex: https://github.com/spider-rs/spider.

Here is an example of a PHP integration with a tool called Equalify https://github.com/bbertucc/equalify/tree/176-a11ywatch-integration - exact commit for main code required https://github.com/j-mendez/equalify/commit/5eddd04653bf91eca15435465b78dab6c30920d8

Integration for nodejs

You can use the sidecar to integrate without going over the wire.

npm i @a11ywatch/a11ywatch --save

import { scan, multiPageScan, crawlList } from "@a11ywatch/a11ywatch";

// single page website scan.
await scan({ url: "https://jeffmendez.com" });

// single page website scan with lighthouse results.
await scan({ url: "https://jeffmendez.com", pageInsights: true });

// all pages
await multiPageScan({ url: "https://a11ywatch.com" });

// all pages and subdomains
await multiPageScan({
  url: "https://a11ywatch.com",
  subdomains: true,
});

// all pages and tld extensions
await multiPageScan({ url: "https://a11ywatch.com", tld: true });

// all pages, subdomains, and tld extensions
await multiPageScan({
  url: "https://a11ywatch.com",
  subdomains: true,
  tld: true,
});

// all pages, subdomains, sitemap extend, and tld extensions
await multiPageScan({
  url: "https://a11ywatch.com",
  subdomains: true,
  tld: true,
  sitemap: true
});

// multi page scan with callback on each result asynchronously
const callback = ({ data }) => {
  console.log(data);
};
await multiPageScan(
  {
    url: "https://a11ywatch.com",
  },
  callback
);

Ps: only reason for multiple languages and not Rust for everything is due to needing to customize browsers that are made in different languages or tweaking.

closing, due to activity.