GovTechSG/purple-a11y

How to exclude from scanning sites that are in other subdomains?

zwiastunsw opened this issue · 9 comments

How to exclude from scanning sites (pages) that are in other subdomains? How to make the crawl not go beyond the indicated domain (subdomain)?

Hi there,

I assume you are asking about website (website crawl) scan. Purple HATS automatically excludes subdomains (or any domain does not match the starting URL's). You can verify the list of URLs scanned by Purple HATS by examining details.json located in your results folder.

For example, in ./purple-hats/results/PHScan_<domain name>_.../details.json you will see the list of URLs crawled and URLs out of domain:

...
  "urlsCrawled": {
    "toScan": [],
    "scanned": [
    ...
    ],
    "invalid": [],
    "outOfDomain": [
   ...
    ]
  }
...

Do I understand correctly that the list of addresses given in the array "outOfDomain": is the list of addresses found, but excluded from the survey?

PS My congratulations and huge thanks! This is a wonderfully prepared automated testing tool.

In the scanning setings of th min domain address (e.g. https://lepszyweb.pl), I put the patterns of subdomain addreses I want to exclude in the exclusions.txt file. Unfortunately, sites in subdomains are also scanned.
My exclusions.txt file

\.*wcag.lepszyweb.pl\.*
\.*wcag21.lepszyweb.pl\.*
\.*tad.lepszyweb.pl\.*
\.*deklaracja.lepszyweb.pl\.*
\.*przedipo.lepszyweb.pl\.*
\.*raport.lepszyweb.pl\.*
\.*walidator.lepszyweb.pl\.*
\.*kontrast.lepszyweb.pl\.*
\.*testy.lepszyweb.pl\.*

I don't know how to make it scan only domain addresses. Of cours, I can use sitemap, but I would like to use crawl option.

Do I understand correctly that the list of addresses given in the array "outOfDomain": is the list of addresses found, but excluded from the survey?

Yes that is correct.

PS My congratulations and huge thanks! This is a wonderfully prepared automated testing tool.

Thank you, it would be great if you can share more about what we have done well, how you are using Purple-hats and ways we can improve. 😃

In the scanning setings of th min domain address (e.g. https://lepszyweb.pl), I put the patterns of subdomain addreses I want to exclude in the exclusions.txt file. Unfortunately, sites in subdomains are also scanned. My exclusions.txt file

\.*wcag.lepszyweb.pl\.*
\.*wcag21.lepszyweb.pl\.*
\.*tad.lepszyweb.pl\.*
\.*deklaracja.lepszyweb.pl\.*
\.*przedipo.lepszyweb.pl\.*
\.*raport.lepszyweb.pl\.*
\.*walidator.lepszyweb.pl\.*
\.*kontrast.lepszyweb.pl\.*
\.*testy.lepszyweb.pl\.*

I don't know how to make it scan only domain addresses. Of cours, I can use sitemap, but I would like to use crawl option.

The exclusions.txt is only needed for Custom flow scan. In the website crawl and sitemap scan modes, sub-domains and other domains that do not match the website URL you want to scan is automatically excluded.

Hope it helps.

Unfortunately, this is not the case. You can find out by doing a scan of the page I provided. Scanning the site https://lepszyweb.pl yields results from both the main domain and subdomains.
I working on the Windows 11.
I testing both - portable purple hats v. 0.0.15 and purple-hats-master - downloaded on 2023/05/26

Hi @zwiastunsw,

Thanks for sharing your experience. I have implemented an advanced scan option: -s "same-hostname" that will make the crawler match the hostname in the url provided for the scan.

The default scans without the -s "same-hostname" option will remain where the crawler will match "same-domain" (sub-domains).

This is available in the new release https://github.com/GovTechSG/purple-hats/releases/tag/0.9.0

 -s, --strategy       Strategy to choose which links to crawl in a website scan
                       . Defaults to "same-domain".
                                       [choices: "same-domain", "same-hostname"]

Let me know if this meets your usage scenario?

Thank you, this solves my problem perfectly.

Glad I am able to assist and provide the new option to exclude sub-domains in scanning. 😄

I will close the issue as completed.