Applied-Machine-Learning-using-Python---Assignment-2
Assignment 2
In this assignment you'll explore the relationship between model complexity and generalization performance, by adjusting key parameters of various supervised learning models. Part 1 of this assignment will look at regression and Part 2 will look at classification.
FOR CLASSIFICATION:
Here's an application of machine learning that could save your life! For this section of the assignment we will be working with the UCI Mushroom Data Set stored in readonly/mushrooms.csv. The data will be used to train a model to predict whether or not a mushroom is poisonous. The following attributes are provided:
Attribute Information:
cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s cap-color: brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y bruises?: bruises=t, no=f odor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s gill-attachment: attached=a, descending=d, free=f, notched=n gill-spacing: close=c, crowded=w, distant=d gill-size: broad=b, narrow=n gill-color: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y stalk-shape: enlarging=e, tapering=t stalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=? stalk-surface-above-ring: fibrous=f, scaly=y, silky=k, smooth=s stalk-surface-below-ring: fibrous=f, scaly=y, silky=k, smooth=s stalk-color-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y stalk-color-below-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y veil-type: partial=p, universal=u veil-color: brown=n, orange=o, white=w, yellow=y ring-number: none=n, one=o, two=t ring-type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z spore-print-color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y habitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d
The data in the mushrooms dataset is currently encoded with strings. These values will need to be encoded to numeric to work with sklearn. We'll use pd.get_dummies to convert the categorical variables into indicator variables.