BU2010 Web Scraping

Introduction

Overview

This week you get to try out scraping (harvesting) the web.

Advice

Once you are done be sure to commit your changes again (to save them to the repository) and to push them to GitHub (so that I can see the work you have done).

It's best if you Commit early, commit often - that way you can go back if you have made a mistake and I can see all the work you have done. It doesn't matter if your early commits contain mistakes.

Once you are done be sure to commit your changes (that will save them to the repository) and to push them to GitHub (so that I can see the work you have done).

Task Explanation

Below is today's task. We will be working with a range of skills you learned in recent weeks.

For this task, you need to provide the answers in the scrape.R file you will find in this repository, not in the Markdown file you are currently reading.

Your Task

Your tasks can be found in the R script file scrape.R that is part of this repository. You have to work on these tasks by yourself. Do not work with others.

Please work on these tasks in RStudio - not on the GitHub website. If you work in RStudio you can make sure your code works as it should. If you don't work in RStudio, but edit the file on the GitHub web interface you will have to copy and paste the code into R for testing - an unnecessary step that can introduce mistakes.

Pay attention to the autocomplete options RStudio is offering you and use them to explore how R commands work. Also, remember how useful the cursor keys and the Tab key can be. Pressing F1 will bring up the documentation for the selected command in the Help tab.

Please don't forget to commit and to push your commit.

You will find the following tasks in scrape.R. In that file please write your code below the comment with the task. I have started some of the lines with the code for the answers for you.

Task 1

Set the correct Working Directory.

Hint: I showed you how to do this in previous weeks. It can be done with a command or through the GUI (Graphical User Interface).

Task 2

Load the packages you need.

Task 3

Read (import) the HTML file from The Pen Company. It is in this repo.

Task 4

Use Selector Gadget to find the CSS Selector for the elements that contain the brand name.

Scrape the html text from these elements, store them in a vector and clean them of unwanted strings.

Do the same for the product names.

Do the same for the prices.

Task 5

Combine the three vectors into one data frame.

Note

Thank you to James from The Pen Company for allowing use to use their web site as an example.