Imputation Methods for Missing Data

This is a basic python code to read a dataset, find missing data and apply imputation methods to recover data, with as less error as possible.

About

This code is mainly written for a specific data set. Taken a specific route to write it as simple and shorter as possible. Since the debug data set was not very suitable for this kind of code, there are some necessary hard-codings.

Initialization has only the file name, and the separator used in the file type. Since the debug file was not readable with csv-reader functions, it reads the file as string and seperates it with given separator. Below are the imported settings needed to set-up.

sep = ','  # Separator
fileName = "kddn"  # File name
fileNameLoss = fileName+".5loss"  # File name with lost data (Used 5loss because my data was missing 5%)
createOutputFile = True
calculateMSE = True

File import was done with with open method of python. It reads the file, line by line, then import them properly into a list. If data has strings or anything that can't be converted to float, the program should give it a numerical id to keep things easy to calculate. Then it converts the list into numpy array to make calculations faster. Also, while importing, the program also finds and appends the missing values as indexes, while also generating a non-missing version of the imported file (if the row has a missing data, skip it) which makes calculations easier.

After importing, there are 4 imputation methods available to use in this code:

Least Squares Data Imputation
Naive Bayes Imputation
Hot Deck Imputation
Imputation with Most Frequent Element

The program loops every element of missing with;

for idx,v in enumerate(missing):
    i,j = v  # Gets the index of missing element

And imputes each element with the methods below. After every missing data gets imputed, it calculates the Mean Squared Error and prints it out. Then starts writing the file.

Functions Used

elapsedStr(): Function that calculates elapsed time and returns it as a string. Needs init for global tT first.

t = abs(tT-timeit.default_timer())
h = int(t / 3600)
m = int((t - 3600 * h) / 60)
s = round((t - 3600 * h) - 60 * m)

isfloat(s): Function to check if value is float. Returns true if castable.

try:
    float(s)
    return True
except:
    return False

give_id(v): Function to give ids to strings. Helps to make numerical calculations easier. Needs global strID (id list) and strings list.

if v in strings:
    return strings[v]
else:
    strings[v] = strID
    strID += 1
    return strID-1

get_id(v): Function that returns the string of the given id. Needs global strings list.

v=round(v)
return next((st for st, k in strings.items() if k == v), None)

mse(): Function that calculates mean squared error.

total = 0.0
for _, v in enumerate(missing):
    i, j = v
    x = imported[i][j]
    y = original[i][j]
    ...
    total += abs(x - y) #Adds everything to the grand total
return math.sqrt(total/miss) #Returns the root of the average

Imputation Methods

For each code examples below; imported is the data set and i,j is the found missing data's index.

Least Squares Data Imputation

This method imputes the missing data with least squares formula and rewrites the data.

B = np.dot(np.dot(np.linalg.pinv(np.dot(nonZeroT, nonZero)), nonZeroT), tagSet)  # ß'=(Xᵀ.X)⁺.Xᵀ.y
...
sumB = sum([b*imported[i][idx] for idx, b in enumerate(B) if idx != j])  # Does dot product of B and row, except i, sums all.
imported[i][j] = (tagSet[i] - sumB) / B[index]  # Then solves x for ß'[j].x + sum_of_ß' = y[i]

Naive Bayes Imputation

This method uses the Naive Bayes method to impute with frequency, in tandem with tags. Imputes the most frequent element on the column of the missing data with relation to same row's tag.

tagMiss = tagList[i]  # Missing data's tag
currentColumn = [r[j] for k,r in enumerate(importedNM) if tagListNM[k] == tagMiss]  # Gets the whole column with matching tags.
imported[i][j] = Counter(currentColumn).most_common(1)[0][0]  # Imputes most common one.

Hot Deck Imputation

This most common method gets the geometric distance of each row to the missing data's row and uses a kHD (default:20) value to determine how many of the most close rows' element should be picked as the most common one. In other words, imputes the geometrically closest rows' most common data.

sorted(euclidean, key=lambda l: l[0], reverse=True)  # Sorts the euclidean distance list by their distance value [distance,index]
lst = [imported[euclidean[r][1]][j] for r in range(kHD)]  # Gets the list of first kHD elements of those values
imported[i][j] = Counter(lst).most_common(1)[0][0]  # Imputes the most common element from above list.

Imputation with Most Frequent Element

This impractical method is just there to add some spice and allows comparison for other methods' results. It imputes the most common element of that column, regardless of anything else. Fast, but highly unreliable.

currentColumn = [r[j] for r in importedNM]
imported[i][j] = Counter(currentColumn).most_common(1)[0][0]

Bugs

Bug reports and code recommendations are always appreciated.

There is also lots of TODO in the code, I'll get to fixing them later.

tarikbir/missing_data_imputation