tue-mdse/genderComputer

Country stats

Closed this issue · 3 comments

Where does the data in countryStats.csv come from and what does it mean? I've been adding a few additional countries and in some cases there is overlap in names and I get a failure when there is no entry here so I'd like to update this as well.

This was the number of Stack Overflow participants from the particular countries back when we worked on genderComputer. This information is used in genderComputer.py during the initialisation phase:

'''Distribution of StackOverflow users per different countries'''			
		fd = open(os.path.join(self.dataPath, 'countryStats.csv'), 'r')
		reader = csv.reader(fd, delimiter=';', dialect=csv.excel)
		self.countryStats = {}
		total = 0.0
		for row in reader:
			country = row[0]
			numUsers = float(row[1])
			total += numUsers
			self.countryStats[country] = numUsers
		for country in self.countryStats.keys():
			self.countryStats[country] = self.countryStats[country] / total

Thanks! I had some idea of how it was used, but I was curious where the numbers came from. My understanding is that if a country is not explicitly specified and a name appears in lists for multiple countries, than the more one will break the tie, correct?

I am not sure but I guess that it comes from applying the countryNameManager to the StackOverflow data.