Gender-Prediction-using-Sound

The same name can be spelled out in a many ways, for example, Marc and Mark. Sound can, therefore, be a better way to match names than spelling.In this project, I will use the Python package Fuzzy to find out the genders of authors that have appeared in the New York Times Best Seller list for Children's Picture books.

1. Sound it out!

Grey and Gray. Colour and Color. Words like these have been the cause of many heated arguments between Brits and Americans.
One way to tackle this challenge is to write a program that checks if two strings sound the same, instead of checking for equivalence in spellings. We'll do that here using fuzzy name matching.

print(fuzzy.nysiis('gray'))
>>> GRY

fuzzy.nysiis('colour') == fuzzy.nysiis('color')
>>> True

2. Authoring the authors

Let's begin by reading in the data on the best selling authors from 2008 to 2017.

author_df = pd.read_csv('datasets/nytkids_yearly.csv', delimiter=';')

first_name = []
for name in author_df['Author']:
    first_name.append(name.split()[0])
author_df['first_name'] = first_name

author_df.head()

	Year	Book Title	Author	Besteller this year	first_name
0	2017	DRAGONS LOVE TACOS	Adam Rubin	49	Adam
1	2017	THE WONDERFUL THINGS YOU WILL BE	Emily Winfield Martin	48	Emily
2	2017	THE DAY THE CRAYONS QUIT	Drew Daywalt	44	Drew
3	2017	ROSIE REVERE, ENGINEER	Andrea Beaty	38	Andrea
4	2017	ADA TWIST, SCIENTIST	Andrea Beaty	28	Andrea

3. Time to bring on the phonics!

When we were young children, we were taught to read using phonics; sounding out the letters that compose words. So let's relive history and do that again, but using python this time.

nysiis_name = []
for name in author_df['first_name']:
    nysiis_name.append(fuzzy.nysiis(name))

author_df['nysiis_name'] = nysiis_name

author_df.head()

	Year	Book Title	Author	Besteller this year	first_name	nysiis_name
0	2017	DRAGONS LOVE TACOS	Adam Rubin	49	Adam	ADAN
1	2017	THE WONDERFUL THINGS YOU WILL BE	Emily Winfield Martin	48	Emily	ENALY
2	2017	THE DAY THE CRAYONS QUIT	Drew Daywalt	44	Drew	DR
3	2017	ROSIE REVERE, ENGINEER	Andrea Beaty	38	Andrea	ANDR
4	2017	ADA TWIST, SCIENTIST	Andrea Beaty	28	Andrea	ANDR

4. The inbetweeners

We'll use babynames_nysiis.csv, a dataset that is derived from the Social Security Administration’s baby name data, to identify author genders. The dataset contains unique NYSIIS versions of baby names, and also includes the percentage of times the name appeared as a female name (perc_female) and the percentage of times it appeared as a male name (perc_male).

babies_df = pd.read_csv('datasets/babynames_nysiis.csv', delimiter=';')

gender = []
for i in range(len(babies_df)):
    if babies_df.iloc[i]['perc_male'] > babies_df.iloc[i]['perc_female']:
        gender.append('M')
    elif babies_df.iloc[i]['perc_male'] < babies_df.iloc[i]['perc_female']:
        gender.append('F')
    else:
        gender.append('N')
        
babies_df['gender'] = gender

babies_df.head()

	babynysiis	perc_female	perc_male	gender
0	NaN	62.50	37.50	F
1	RAX	63.64	36.36	F
2	ESAR	44.44	55.56	M
3	DJANG	0.00	100.00	M
4	PARCAL	25.00	75.00	M

5. Playing matchmaker

Now that we have identified the likely genders of different names, let's find author genders by searching for each author's name in the babies_df DataFrame, and extracting the associated gender.

def locate_in_list(a_list, element):
    loc_of_name = a_list.index(element) if element in a_list else -1
    return(loc_of_name)

author_gender = []
for name in author_df['nysiis_name']:
    nloc = locate_in_list(list(babies_df['babynysiis']), name)
    if nloc == -1:
        author_gender.append('Unknown')
    else:
        author_gender.append(babies_df['gender'][nloc])

author_df['author_gender'] = author_gender

author_df['author_gender'].value_counts()

F	395
M	191
Unknown	9
N	8
Name: author_gender, dtype: int64

6. Tally up

From the results above see that there are more female authors on the New York Times best seller's list than male authors. Our dataset spans 2008 to 2017. Let's find out if there have been changes over time.

years = sorted(author_df.Year.unique())

males_by_yr = []
females_by_yr = []
unknown_by_yr = []

for year in years:
    males_by_yr.append(len(author_df[(author_df['author_gender']=='M') & (author_df['Year']==year)]))
    females_by_yr.append(len(author_df[(author_df['author_gender']=='F') & (author_df['Year']==year)]))
    unknown_by_yr.append(len(author_df[(author_df['author_gender']=='Unknown') & (author_df['Year']==year)]))

males_by_yr 

>>> [8, 19, 27, 21, 21, 11, 21, 18, 25, 20]

7. Foreign-born authors?

Our gender data comes from social security applications of individuals born in the US. Hence, one possible explanation for why there are "unknown" genders associated with some author names is because these authors were foreign-born.

years_shifted = list(np.array(years) + 0.25)

plt.bar(years, males_by_yr, width=0.25, color='lightblue')
plt.bar(years_shifted, females_by_yr, width=0.25, color='pink')

plt.xlabel('years')
plt.show()

shukkkur/Gender-Prediction-using-Sound