In my final project of the CS50’s Introduction to Programming with Python online course from Harvard / edx, I am scraping mineral data from Wikipedia using BeautifulSoup, requests and regular expressions.
I wanted a list with mineral data including chemistry and Strunz class that can be used with a free license. I couldn't find anything like that on the web.
The script crawls the links in https://en.wikipedia.org/wiki/List_of_minerals and collects the following data (as long it can find it), cleans it and saves it to minerals.csv
:
- name (title on Wikipedia)
- url (on Wikipedia)
- category
- chemistry as plain text
- chemistry as html
- IMA Symbol
- Strunz class
- crystal system
- crystal class (as text, point group, H-M symbol, respectively)
- color
- cleavage
- Mohs scale
- streak
- Specific gravity
- luster
- habit
- varieties
- summary (first paragraph on Wikipedia)
The second output file varieties.csv
contains the basic info about varieties found on the list of minerals.
The script cleans the data slightly:
- Remove footnotes
- Remove linebreaks
- Remove links, any
<style>
and<span>
tags and any attributesclass="foo"
andstyle="bar"
in the "chemistry html" field - Remove
{\displaystyle ... }
from alt attribute of img - If "crystal class" data was within "crystal system" (i.e. not in a seperate table cell), move it to "crystal class".
- Move point group symbol / H-M symbol from "crystal class" into separate columns. On Wikipedia, this is usually in the form of "Prismatic (2/m)" or "4/mmm - Ditetragonal dipyramidal". I don't catch deviations.
- On Wikipedia, Strunz class may contain extra info such as "9.DG.05 (10 ed) 8/F.18.40 (8 ed)". Only return valid Strunz classes, using the latest edition.
- For most data the script relies on the existance of an info-box: On Wikipedia sites without info-box, most data fields will be empty.
- Sometimes wikipedia uses svg images for formulas or crystal class, the script only gets the alt attribute of the img and that might be useless and broken.
- Fields with numbers may be cluttered with additional text and can't be simply converted to int. Examples: Mohs scale of "2 - 3 - Gypsum-Calcite" or "3+1⁄2", gravity of "3.859 calculated; 3.8–3.9 measured".
- Beware of "cubic or tetragonal" etc. in crystal class.
- Messy data: Crystal class may be "Ditetragonal dipyramidal 4/mmm (4/m 2/m 2/m) -" or "Unknown space group" or even "aluminium arsenite".
- Major changes on Wikipedia might break the script.
A version of the resulting minerals.csv is included in the repository. Feel free to use it under the terms of Creative Commons Attribution-ShareAlike 3.0 Unported License, Data © Wikipedia editors and contributors.
- BeautifulSoup
- requests
Run in the terminal:
python project.py
For unit tests, run:
pytest test.project.py
Note that the file andradite.html is used by the tests.
The source code is under MIT license, the scraped data is © Wikipedia editors and contributors, Creative Commons Attribution-ShareAlike 3.0 Unported License.
Parses Wikipedias List of Minerals with BeautifulSoup, calls extract_minerals() or extract_varieties() to extract links, calls get_mineral() on each mineral link and saves results to CSV.
Helper function to be used on Wikipedias List of Minerals to decide whether extract_minerals() or extract_varieties() should be called.
Returns True if the ul is preceded by dl tag. On the wikipedia list of minerals, the lists of varietes are preceded by:
<dl><dd>Varieties that are not valid species:</dd></dl>
Extract all links to mineral wikipedia sites from an html <ul>
list and returns them as a python list.
Extract varieties data from Wikipedias list of minerals.
Get html with request, parse it with BeautifulSoup, process it by calling mineral_data() and return mineral as dict.
Get the data from the html of a Wikipedia mineral site; notably from the info box with a call to get_infobox_value(). Also clean the data.
Takes a html tag such as <th>
, searches for the corresponding <td>
tag and returns the value within this tag.
Get varieties from infobox, return list.
Get cleaned version of the chemistry as HTML.
Helper function used by get_chemistry_html().
Return empty string if tag has class="reference"
, return text of <a>
, else decode tag and return result.
Return cleaned version of first paragraph of the Wikipedia article.
Take list of dictionaries and turn it into a string that can be used in CSV files. Used for varieties.