Geolocalisation Data Challenge

Objective

A company that produces connected device needs to locate the position of their device. They could use GPS but using prediction based on data seems cheaper. Prediction is based on the message reception information.

Context

This project was part of the curriculum of our Post-Master's Degree in Big Data at Telecom Paris. Full curriculum and details on this Degree here.

Team Members

Name	Github
Gaël Savouré	savoga
Thomas Rivière	t-riviere
Hiroto Yamakakawa	yamhiroto

Input

Inputs are different information about messages:

Note: rssi (Received Signal Strength Indicator) is an estimation of the signal power level received by the device.

Note: we won't use column nseq and time_ux.

Output

Outputs are the localisation (latitude and longitude) of the located device:

Materials

For this challenge, we have:

Training samples and their corresponding predictions
Test samples for which we need to produce predictions associated

Project structure

LOAD DATA
DATA EXPLORATION
- Map
- Distribution
PREPROCESSING
PREDICTION
- Linear regression
  - Cross validation
  - Performance measure
- Random forests
  - Cross validation Leave One Device Out
  - Performance measure (2)
POSTPROCESSING
LIMITS AND IMPROVEMENTS

Load data

Three dataframes are used:

training samples
their corresponding predictions
test samples

Data exploration

Map

We first display device (from message data) and bases on a map. We notice some bases are very far away from the device, which are outliers.

We then decide to display the same map without outliers. Arbitrary distances have been choosen: we keep latitude between 43 and 65 and longitudes between -65 and -104.

Distribution

Distributions are plotted to get a view on distances repartitions. Function dist__calculation is used to compute the distance based on latitude and longitude.

Preprocessing

We remove outliers from the train set. After several tests, it looked that removing device further than 10 kms gives much better results. Besides, it is reasonable to include device only with small distances to the base.

We then build the feature matrice. In rows there are the message IDs. In columns, all the bases from the train set. Those are repeated 3 times: for rssi, for base latitude and base longitude.

The associated predicted values need to have the same format. Thus, we need to group by message id (messid) the latitude and longitude, using the mean.

Prediction

Linear regression

"Cross validation"

We use the sickit-learn function cross_val_predict to predict latitude and longitude. This function takes in parameter the number of folds cv. Using cross_val_predict: "For each element in the input, the prediction that was obtained for that element when it was in the test set". As explained, it is not a scoring function, just several predictions based on changing samples.

Suppose we have the following training set:

[1] X1 -> Y1

[2] X2 -> Y2

[3] X3 -> Y3

[4] X4 -> Y4

We now perform cross_val_predict(cv=2)

[1] X1 -> Y1 (training)

[2] X2 -> Y2 (training)

[3] X3 -> Y3' (prediction)

[4] X4 -> Y4' (prediction)

[1] X1 -> Y1' (prediction)

[2] X2 -> Y2' (prediction)

[3] X3 -> Y3 (training)

[4] X4 -> Y4 (training)

At the end we have [Y1', Y2', Y3', Y4']

It appears that linear regression gives outliers (latitude <-90 or >90). We thus remove those corrupted data:

indexes_to_remove = np.where((y_pred_lat > 90) | (y_pred_lat < -90))[0]

Performance measure

Next we plot the cumulative probabilities. This is simply the cumulative sum of errors divided by the sum of errors:

plt.plot(base[:-1]/1000, cumulative / np.float(np.sum(values)) * 100.0, c='blue')

We look at the error of the 80th percentile, that is around 7.5 kms on the figure.

Random forests

Cross validation Leave One Device Out

Leave One Device Out strategy consists in splitting the whole train set into unique device. It allows to make the training and the prediction on distinct device.

Performance measure (2)

Performance measure is the same as for linear regression, but we do this multiple time (in fact, the number of device) and take the mean as a final score.

We note that Random Forests gave significant better results than linear regression.

Postprocessing

Training bases and test bases are different, we thus can't build the same matrice of features as above (columns will be different). We decided to use the same structure as for the training phase, that is using the same bases. Our rationale behind this is that we can't predict using new bases as we never trained on them and don't know their signal reliability. Note that after building the structure, the double loop can take some time.

Limits and Improvements

We arbitrarly removed outliers that gave bad scores to our models. However, outliers can be present in the test sample and make the prediction bad.

We could think about several other models, especially XGBoost that usually gives good results.

yamhiroto/IOT-Geolocalisation-ML