An R package to responsibly scrape gov.scot
I am looking into how the impact of NRS statistics on Scottish Government policy might be measured. One approach is to search through content on gov.scot for mentions of certain phrases relevant to NRS, to see if patterns arise over time or between different directorates and topics.
There are a number of ways this could be achieved. Here is a summary of my views on the pros and cons of each.
Google search results should all contain the strings you're searching for. So there should be less data to analyse.
You're relying on Google search results which can change as Google's methodology changes. This doesn't return all pages (i.e. the ones without mentions) which means you don't have a denominator to create rates from.
This would return comprehensive and reasonably structured data which might avoid some of the messiness of scraping.
As far as I know there is no API for content on gov.scot
This would be a comprehensive dataset to analyse.
This would would only include web text (since I presume a zip file of all the supporting documents would be prohibitively large).
This seems to be the only way to search through supporting documents (at least machine readable ones).
Time needed to write a script.
Option 5 - Use web crawling software (e.g. Screaming Frog)
Less time to set up.
These often cost money and may be time consuming to run on a regular basis.
Option 6 - Use Google custom search
I'm not sure what the pros are.
This could be technically challenging and there may be a cost involved. It might also be overkill.