Re-Format Classification for Discrete Features
Opened this issue · 0 comments
As an example of the current implementation, if we are converging the "Purpose of Loan" data point into a number that SGD can interpret, we currently use a dictionary with the following setup:
{"":-1, "credit_card":0, "car":1, "small_business":2, "other":3, "wedding":4, "debt_consolidation":5, "home_improvement":6, "major_purchase":7, "medical":8, "moving":9, "vacation":10, "house":11, "renewable_energy":12, "educational":13}
The purpose of the loan is read from the data point, and is then converted via the above dictionary into a numerical value.
The problem with this system is that it's entirely unclear why these purposes would follow a linearly sequential system - which would make it difficult for the linear classifier to make correct observations based on our current conversion from loan purpose to a numerical value.
A new implementation design would feature a 14-vector, the 14 features of which which would take the place of this one feature in our feature vector. Each value be a binary indicating whether or not that specific loan had the purpose corresponding to that index in the vector.
For example, the vector may be formatted with the following index correspondences:
[credit_card, car, small_business, other, wedding, debt_consolidation, home_improvement, major_purchase, medical, moving, vacation, house, renewable_energy, educational]
In this case, a vector of the form [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] would indicate that the borrower intended on using their loan for a small business.
This implementation is not susceptible to the problems of the unclear linear relationship described above.