/PSSiteScraper

This Powershell script has the ability to retrieve and output all of a site's URIs by scraping the sitemap of a website, and gives an option to warm the site automatically or manually through generated curls.

Primary LanguagePowerShellGNU General Public License v3.0GPL-3.0

Scrape-Warm-Site

This Powershell script has the ability to retrieve and output all of a site's URIs by scraping its sitemap for links, followed by parsing HTML of those links, and gives an option to warm the site automatically or manually through generated curls.

Features:

  • Starting with a site's parent sitemap, scrape for child sitemaps and get all links
  • Output sitemaps and links as list and curls in individual files
  • Choice whether to scrape links for URIs found in any tag-attribute combination you want. E.g. <a href>, <img src>, <img srcset>, <img data-src>, <img data-srcset>, <link rel>, <script src>
  • URIs are domain-specific (i.e. same domain as the sitemaps and links)
  • Output each tag-attribute's URIs to files
  • Output each tag-attribute's URIs as curls to files
  • Choice whether to warm site with the above URIs as part of script

Requirements:

Installation/usage:

  • Open the config.ps1 in your favourite text editor and configure scripts settings
  • WinNT:
    • Right click on the script in explorer and select Run with Powershell. (should be present on Windows 7 and up)
    • Alternatively, open command prompt in the script directory, and run Powershell .\Scrape-Warm-Site.ps1
  • *nix:
    • Run powershell ./Scrape-Warm-Site.ps1 or pwsh ./Scrape-Warm-Site.ps1 depending on which version of powershell you're running.  

FAQ

WinNT

Q: Help! I am getting an error 'File C:...Scrape-Warm-Site.ps1 cannot be loaded because the execution of scripts is disabled on this system. Please see "get-help about_signing" for more details.'

  • You need to allow the execution of unverified scripts. Open Powershell as administrator, type Set-ExecutionPolicy Unrestricted -Force and press ENTER. Try running the script again. You can easily restore the security setting back by using Set-ExecutionPolicy Undefined -Force.

Q: Help! Upon running the script I am getting an error File C:...Scrape-Warm-Site.ps1 cannot be loaded. The file C:...\Scrape-Warm-Site.ps1 is not digitally signed. You cannot run this script on the current system. For more information about running scripts and setting execution policy, see about_Execution_Policies at http://go.microsoft.com/fwlink/?LinkID=135170.

  • You need to allow the execution of unverified scripts. Open Powershell as administrator, type Set-ExecutionPolicy Unrestricted -Force and press ENTER. Try running the script again. You can easily restore the security setting back by using Set-ExecutionPolicy Undefined -Force.

Q: Help! Upon running the script I am getting a warning 'Execution Policy change. The execution policy helps protect you from scripts that you do not trust. Changing the execution policy might expose you to the security risks described in the about_Execution_Policies help topic at http://go.microsoft.com/?LinkID=135170. Do you want to change the execution policy?

  • You need to allow the execution of unverified scripts. Type Y for yes and press enter. You can easily restore the security setting back opening Powershell as administrator, and using the code Set-ExecutionPolicy Undefined -Force.

*nix

Nil

Known issues

Nil

Additional Information

  • By default, script directory (where you run the script) needs read, execute, write permissions. All created files/folders will reside in the script directory.

Background:

  • Website owners may want to warm their site (i.e. "preload the cache") from a remote client especially so if they use Content Delivery Networks (CDNs).
  • Search Engine Optimization (SEO) typically involves optimizing a website's load times, and one of the most effective means of doing so is to preload or 'warm' the web cache. This script can be configured to do this automatically; alternatively site warming can be achieved through using the curls generated in separate files for portability.
  • Website owners might want a list of links of all their resources (blog posts, media, etc.) if they intend to migrate their site (e.g. changing a domain name). This script can search for all of those and output them as a list.
  • Website owners may simply need a list of their sitemaps, or links from those sitemaps.