/test2.0

Primary LanguageJupyter Notebook

Web Scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.

urllib.request is a Python module for fetching URLs (Uniform Resource Locators).

Beautiful Soup is a Python library, used for pulling HTML & XML data out of websites straight into your code

  • Beautiful Soup allows us to hand-pick specific elements on the web page, such as: <div>,<ul>,<table>, <a>, and other tags.
  • It allows us to target elements with specific attributes, for example: targeting all the <div class=”main”> elements, or the elements with <img width=300>.
  • It also provides us with handy functions such as: targeting text or hyperlinks within a given element.

Python has a module named re to work with RegEx

A Regular Expression (RegEx) is a sequence of characters that defines a search pattern. For example, ^a...s$. This code defines a RegEx pattern and the pattern is: any five letter string starting with a and ending with s.