BU2010 Web Scraping

Introduction

Overview

This week you get to try out scraping (harvesting) the web.

Advice

Once you are done be sure to commit your changes again (to save them to the repository) and to push them to GitHub (so that I can see the work you have done).

It's best if you Commit early, commit often - that way you can go back if you have made a mistake and I can see all the work you have done. It doesn't matter if your early commits contain mistakes.

Task Explanation

Below is today's task. We will be working with a range of skills you learned in recent weeks.

For this task, you need to provide the answers in the scrape.R file you will find in this repository, not in the Markdown file you are currently reading.

Your Task

Your tasks can be found in the R script file scrape.R that is part of this repository. You have to work on these tasks by yourself. Do not work with others.

Please work on these tasks in RStudio - not on the GitHub website. If you work in RStudio you can make sure your code works as it should. If you don't work in RStudio, but edit the file on the GitHub web interface you will have to copy and paste the code into R for testing - an unnecessary step that can introduce mistakes.

Pay attention to the autocomplete options RStudio is offering you and use them to explore how R commands work. Also, remember how useful the cursor keys and the Tab key can be. Pressing F1 will bring up the documentation for the selected command in the Help tab.

Please don't forget to commit and to push your commit.

You will find the following tasks in scrape.R. In that file please write your code below the comment with the task. I have started some of the lines with the code for the answers for you.

Task 1

Set the correct Working Directory.

Hint: I showed you how to do this in previous weeks. It can be done with a command or through the GUI (Graphical User Interface).

Task 2

Load the packages you need.

Task 3

Read (import) the HTML file scrapingStaedtler. It is in this repo.

Task 4

Use Selector Gadget to find the CSS Selector for the elements that contain the pen name.

Scrape the html text from these elements, store them in a vector and clean them of unwanted strings.

Do the same for the prices and for the brand name.

Task 5

Combine the vectors you created into one data frame.

Task 6

Write a short explanation what you did, including what you found easy or difficult.

Write your explanaation here:

Task 7

Read (import) the HTML file scrapingPelikan. It is in this repo.

Task 8

Scrape the pen names and prices and combine them into a vector

Task 9

Write a short explanation what you did, including what you found easy or difficult.

Write your explanaation here:

Note

You might not be able to do all tasks, but you should give it a try. Please remember to work by yourself, do not work with others.