covartech/PRT

Making predictions with PRT

Mauro82 opened this issue · 8 comments

Hello everybody and many many thanks to all those people who have developed the very useful Pattern Recognition Toolbox!
I have just started playing with PRT and I have a question about making predictions.
More in detail: once I have loaded my personal dataset, preprocessed the data, created and run the classifier and performed the cross-validation, which commands should I use to make a prediction (consider, for instance, a classification problem with 2 classes) for a new vector (say x) not belonging to the training set?
Thank you very much and best regards,

Mauro

Hi Mauro,

So you have developed an algorithm and evaluated it using cross-validation and you need to run the existing algorithm on new data?

testOutput = algorithm.run(newTestingDataSet)

The output of the algorithm will be contained in in testOutput.X

Does that answer your question?

Kenny

Hello Kenny,

thank you very much for your very fast and kind response.
Actually I have two problems.
The initial situation is that I have a training set (X_Train, t_Train) and a test set (X_Test, t_Test), where vectors t are the target variables.
What I have done is that I have preprocessed both of them with the following commands:

For the training set:
TrainingSet = prtDataSetClass(X_Train, t_Train);
Pca_Train = prtPreProcPca;
Pca_Train = Pca_Train.train(TrainingSet);
dsPca_Train = Pca_Train.run(TrainingSet);

And for the test set:
TestSet = prtDataSetClass(X_Test, t_Test);
Pca_Test = prtPreProcPca;
Pca_Test = Pca_Test.train(TestSet);
dsPca_Test = Pca_Test.run(TestSet);

Then I created a classifier with the Relevance Vector Machines, I trained on the training set and I tested it on the test set with the following commands:

rvmClassifier = prtClassRvm;
rvmClassifier = rvmClassifier.train(dsPca_Train);
yOutTest = rvmClassifier.run(dsPca_Test);
[PF, PD, THRESHOLDS, AUC] = prtScoreRoc(yOutTest);

I plotted the ROC curves but, unfortunately, they are not satisfactory at all.
Then I joined the training and the test set into a single dataset and I performed on it the cross-validation:

X_DS = cat(1, X_Train, X_Test);
t_DS = cat(1, t_Train, t_Test);
DataSet = prtDataSetClass(X_DS, t_DS);
Pca_DS = prtPreProcPca;
Pca_DS = Pca_DS.train(DataSet);
dsPca_DS = Pca_DS.run(DataSet);
yOutCross = rvmClassifier.kfolds(dsPca_DS, 10);
[PF, PD, THRESHOLDS, AUC] = prtScoreRoc(yOutCross);

This time the ROC curves are improved.
What I am wondering is this:

  1. is the approach of mine correct? And
  2. if I have a new vector x not beloning to the dataset and WITHOUT knowing its target variable, how can I predict its target variable t?

Thank you very much again and best regards,

Mauro

Mauro,

There is nothing wrong with the approach. The issue is likely that your algorithm needs to be tweaked (number of principle components etc.) or there is a mismatch between your training and testing data. You will have to explore those possibilities.

The predicted target labels are obtained from the output of classifiers. Below is a modified version of your code that I think may help explain things. The primary change is that the data sets have been change to example datasets in the PRT. Particularly the final plot.

I hope that helps.
Kenny

TrainingSet = prtDataGenUnimodal;
TestSet = prtDataGenUnimodal;

Pca_Train = prtPreProcPca;
Pca_Train = Pca_Train.train(TrainingSet);
dsPca_Train = Pca_Train.run(TrainingSet);

Pca_Test = prtPreProcPca;
Pca_Test = Pca_Test.train(TestSet);
dsPca_Test = Pca_Test.run(TestSet);

rvmClassifier = prtClassRvm;
rvmClassifier = rvmClassifier.train(dsPca_Train);
yOutTest = rvmClassifier.run(dsPca_Test);
[PF, PD, THRESHOLDS, AUC] = prtScoreRoc(yOutTest);

figure
plot(PF,PD);

figure
plot(1:TestSet.nObservations, TestSet.Y, 1:TestSet.nObservations, yOutTest.X)

I just realized another error in your code that could be causing problems. You are re-estimating the PCA vectors from the testing data. This can really mess things up and cause a large mismatch between the training and testing data after the two applications of PCA. The solution is to make a prtAlgorithm out of PCA and the RVM and train and run the algorithm. See the revised code below.

Kenny

TrainingSet = prtDataGenUnimodal;
TestSet = prtDataGenUnimodal;

algo = prtPreProcPca('nComponents',2) + prtClassRvm;
trainedAlgo = algo.train(TrainingSet);

yOutTest = trainedAlgo.run(TestSet);
[PF, PD, THRESHOLDS, AUC] = prtScoreRoc(yOutTest);

figure
plot(PF,PD);

figure
plot(1:TestSet.nObservations, TestSet.Y, 1:TestSet.nObservations, yOutTest.X)

Hi Kenny and, again, thank you very much for your very precious advice and for the time you dedicated to answer to my questions.
I have carefully read your posts and I have tried to execute your code.
The ROC curves now are very close to 1, but I have some questions again.
My first question is about the prtDataGenUnimodal command.
In fact, by reading the online documentation, it seems that this command randomly creates a dataset, while in my situation I have the training set (and also the test set) stored in two mat files: one with the x vectors (where each line is a measurement and each column is a feature) and one with just one column, corresponding to the associated target variables.
Once I load all these data into matrix X_Train and vector t_Train, I use the command

TrainingSet = prtDataSetClass(X_Train, t_Train);

My question is: if I replace this command with TrainingSet = prtDataGenUnimodal; (and with
TestSet = prtDataGenUnimodal; for the test set), how will I be able to train the classifier on my real training set coming from those mat files?

My second question is about the nComponents setting.
What does this option exactly do?
Does it reduce the number of feature or the quality of the data?
In fact, the number of features of my data is 3.

My last question is again on the predictions.
By looking at your code, you compare the output of the test set (that I know from the dataSet) with the predicted value of the classifier.
But how can I do if I have a new vector x, not beloning to the data set, and I want to know its predicted target value?
Let's say, for instance, that the new vector is

x = [0.1, 0.5, 0.4];
I think the predicted value can be obtained by executing something like

t_Predicted = trainedAlgo.run(x);
t_Predicted.X

Once again, thank you very much for your very precious help.

Mauro

Hi Mauro,

A few things -

  1. Re: prtDataGenUnimodal - those were used in Kenny's code as examples; to show that the code and algorithms are working as expected, since we don't have your data. You can use those to evaluate your code, but not to run on your data sets.

  2. nComponents controls how many components to estimate in the PCA pre-processing. See the help for PCA, or the wiki article on PCA for more information about how the number of components can affect processing. If your data is 3-Dimensional, you can use 1- 2- or 3- dimensional PCA. The benefits of one or another depend on your specific data set.

  3. Re: predictions. If you have a data set, but only have X and no Y data, you can make a prtDataSet like so:

dataNoLabels = prtDataSetClass(X); % no labels
yOutNoLables = algo.run(dataNoLabels); %run on this data

So your code might look like:

TrainingSet = prtDataSetClass(X_Train, t_Train);
TestSet = prtDataSetClass(X_Test, t_Test);
ValidationSet = prtDataSetClass(X_Test); %No labels

algo = prtPreProcPca + prtClassRvm; %combine PCA & RVM
algo = algo.train(TrainingSet);
yOutTest = algo.run(TestSet);
yOutValidation = algo.run(ValidationSet);

[pf,pd,thresholds,auc] = prtScoreRoc(yOutTest);

ValidationTargetEstimates = yOutValidation.X; %do whatever you want with these.

Hope this helps.

-Pete

Dear Kenny and Pete,
thank you very much for your enormous help!
Honestly I am very grateful to both of you.
Now I have got it!
So, with the command algo = prtPreProcPca + prtClassRvm everything you give to the algorithm is first preprocessed.
I have tried the code and I have also added the following commands for the cross validation:

X_DS = cat(1, X_Train, X_Test);
t_DS = cat(1, t_Train, t_Test);
DataSet = prtDataSetClass(X_DS, t_DS);
yOutCross = algo.kfolds(DataSet, 10);

If I am not wrong it should work correctly.
I have overlapped the ROC curves and the cross validation results are a little bit better than the single test.

I still have one more question to ask you.
Once a classifier has been built and trained and tested and it has been proved that it provides good predictions, is there a way to read the values of its internal parameters?

But, again, thank you so so much for your very precious help!

Mauro

Hi Mauro,

Yes - for simple classifiers, you can look at the parameters directly; since you've created an "algorithm" containing both PCA and RVM parts, you need to look in

algo.actionCell{1} % PCA
and
algo.actionCell{2} % RVM

For example:

ds = prtDataGenUnimodal;
algo = prtPreProcPca + prtClassRvm; %combine PCA & RVM
algo = algo.train(ds);

algo.actionCell{1} %Display the parts of the PCA
ans =
prtPreProcPca with properties:
name: 'Principal Component Analysis'
nameAbbreviation: 'PCA'
nComponents: 2.00
means: [0.45 0.46]
pcaVectors: [2x2 double]
trainingTotalVariance: 6.16
totalVariance: 6.16
totalVarianceCumulative: [5.48 6.16]
totalPercentVarianceCumulative: [0.89 1.00]
isSupervised: 0
isCrossValidateValid: 1
verboseStorage: 1
showProgressBar: 1
isTrained: 1
dataSetSummary: [1x1 struct]
dataSet: [1x1 prtDataSetClass]
userData: [1x1 struct]

algo.actionCell{2} % the trained RVM
ans =
prtClassRvm with properties:
name: 'Relevance Vector Machine'
nameAbbreviation: 'RVM'
isNativeMary: 0
kernels: [1x1 prtKernelSet]
verboseText: 0
verbosePlot: 0
learningMaxIterations: 1000.00
learningConvergedTolerance: 0.00
learningRelevantTolerance: 0.00
beta: [401x1 double]
sparseBeta: [5x1 double]
sparseKernels: [1x1 prtKernelSet]
learningConverged: 1
twoClassParadigm: 'binary'
internalDecider: []
isSupervised: 1
isCrossValidateValid: 1
verboseStorage: 1
showProgressBar: 1
isTrained: 1
dataSetSummary: [1x1 struct]
dataSet: [1x1 prtDataSetClass]
userData: [1x1 struct]

I hope this helps!

We'll go ahead and close this issue if that's OK.

-Pete