import csv for regression

Question

import csv for regression

axiqia opened this issue 6 years ago · 6 comments

I want to use Rand Forest algorithm to solve a regression problem, and there is only one classfication example tutorial. So I try the bellow code to test,

 RegressionDataset data;  
 importCSV(data, "/data/C.csv", LAST_COLUMN, ' ');

and after I ran, I get some error

terminate called after throwing an instance of 'shark::Exception'
  what():  [importCSVReaderSingleValues] problems parsing file (2)
[1]    19082 abort (core dumped)  ./ExampleProject

I have read the other regression algorithm tutorial, and I fond that all of them use the bellow importCSV to load label and data ,respectively.

void importCSV(
	Data<T>& data,
	std::string fn,
	char separator = ',',
	char comment = '#',
	std::size_t maximumBatchSize = Data<T>::DefaultBatchSize,
	std::size_t titleLines = 0
)

How should I do to solve the problem? And is there someting I missed?

Shark version 3.1.0
Thank you.

Answer 1 · 2018-10-20T17:15:41.000Z

The error means that the file can not be parsed using the options you supplied (e.g. you specify that entries are separate with space, not ','). Without seeing the actual file i have no way to tell you what is wrong.

…

________________________________ From: axiqia [notifications@github.com] Sent: Saturday, October 20, 2018 5:52 PM To: Shark-ML/Shark Cc: Subscribed Subject: [Shark-ML/Shark] import csv for regression (#257) I want to use Rand Forest algorithm to solve a regression problem, and there is only one classfication example tutorial<http://image.diku.dk/shark/sphinx_pages/build/html/rest_sources/tutorials/algorithms/rf.html>. So I try the bellow code to test, RegressionDataset data; importCSV(data, "/data/C.csv", LAST_COLUMN, ' '); and after I ran, I get some error terminate called after throwing an instance of 'shark::Exception' what(): [importCSVReaderSingleValues] problems parsing file (2) [1] 19082 abort (core dumped) ./ExampleProject I have read the other regression algorithm tutorial, and I fond that all of them use the bellow importCSV to load label and data ,respectively. void importCSV( Data<T>& data, std::string fn, char separator = ',', char comment = '#', std::size_t maximumBatchSize = Data<T>::DefaultBatchSize, std::size_t titleLines = 0 ) How should I do to solve the problem? And is there someting I missed? Shark version 3.1.0 Thank you. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#257>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AOWTBuuNVRT0UZ1ZLaJy_BrzsxT6XQrZks5um0awgaJpZM4Xx_i_>.

Answer 2 · 2018-10-21T02:56:57.000Z

Thank you for your quick rely. And I am sorry for not giving the details. I read Sample data set C.csv for test. And bellow is the first few lines.

14.1 0.7 7.4 5.4 5.4 4 1.3 5.4 8.7 6.7 3.4 1.3 4 1.3 3.4 8.7 6.7 8.1 1.3 2.7 0
12.5 0.7 8.3 2.1 4.2 6.9 1.4 5.6 10.4 9 2.1 6.9 1.4 4.2 2.8 4.2 4.2 9 1.4 2.8 0
19 0.7 4.9 5.6 7 11.3 1.4 1.4 7 6.3 3.5 4.2 2.1 3.5 0.7 8.5 4.2 4.9 2.8 0.7 0
20.4 0.7 4.1 2 2.7 13.6 3.4 5.4 8.2 7.5 3.4 2 2 4.8 2 6.8 0.7 6.8 1.4 2 0
5 11 0 3.9 9.1 3.9 7.1 7.1 5.8 12.3 12.3 1.9 1.3 2.6 3.2 2.6 3.9 3.2 5.2 1.3 1.9 0

You see, the entries are separate with space, and my option param is ' ', so it made me confused.

Answer 3 · 2018-10-21T07:39:06.000Z

The error means that the file can not be parsed using the options you supplied (e.g. you specify that entries are separate with space, not ','). Without seeing the actual file i have no way to tell you what is wrong.
…
________________________________ From: axiqia [notifications@github.com] Sent: Saturday, October 20, 2018 5:52 PM To: Shark-ML/Shark Cc: Subscribed Subject: [Shark-ML/Shark] import csv for regression (#257) I want to use Rand Forest algorithm to solve a regression problem, and there is only one classfication example tutorialhttp://image.diku.dk/shark/sphinx_pages/build/html/rest_sources/tutorials/algorithms/rf.html. So I try the bellow code to test, RegressionDataset data; importCSV(data, "/data/C.csv", LAST_COLUMN, ' '); and after I ran, I get some error terminate called after throwing an instance of 'shark::Exception' what(): [importCSVReaderSingleValues] problems parsing file (2) [1] 19082 abort (core dumped) ./ExampleProject I have read the other regression algorithm tutorial, and I fond that all of them use the bellow importCSV to load label and data ,respectively. void importCSV( Data& data, std::string fn, char separator = ',', char comment = '#', std::size_t maximumBatchSize = Data::DefaultBatchSize, std::size_t titleLines = 0 ) How should I do to solve the problem? And is there someting I missed? Shark version 3.1.0 Thank you. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#257>, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AOWTBuuNVRT0UZ1ZLaJy_BrzsxT6XQrZks5um0awgaJpZM4Xx_i_.

Thank you for your hint. I found out where the mistake was. I step through each line of my code and step into the importCSV function, the param separator is always setted as ','. Then I fond that importCSV function has three generic types

/// \brief Import a Dataset from a csv file
void importCSV(
	Data<T>& data,
	std::string fn,
	char separator = ',',
	char comment = '#',
	std::size_t maximumBatchSize = Data<T>::DefaultBatchSize,
	std::size_t titleLines = 0
)

/// \brief Import a labeled Dataset from a csv file
template<class T>
void importCSV(
	LabeledData<blas::vector<T>, unsigned int>& data,
	std::string fn,
	LabelPosition lp,
	char separator = ',',
	char comment = '#',
	std::size_t maximumBatchSize = LabeledData<RealVector, unsigned int>::DefaultBatchSize
)

/// \brief Import a labeled Dataset from a csv file
template<class T>
void importCSV(
	LabeledData<blas::vector<T>, blas::vector<T> >& data,
	std::string fn,
	LabelPosition lp,
	std::size_t numberOfOutputs = 1,
	char separator = ',',
	char comment = '#',
	std::size_t maximumBatchSize = LabeledData<RealVector, RealVector>::DefaultBatchSize
)

I realizeed I had to specify the param numberOfOutputs. The brief description didn't tell the difference between tha last two function at all.
Why not design a unified interface? And if there is a regression example in the document, I think it wil help the new user like me a lot.
Thank you again.

Answer 4 · 2018-10-21T07:44:50.000Z

Unified interface does not make sense.

The first version does not have a label, so it is confusing to have to specify a label position.
Second version is for class læabels. There can only be one column for that, so no need for number of outputs.

third version is for regression, there we can have vectorial labels.

We are still working on making the tutorials better, I will try to include that in a future Data section

Answer 5 · 2018-10-21T07:57:21.000Z

Yeah, I have realized difference among the three version :). Maybe the comments should be as clear as you said.
And the error information like bellow

'shark::Exception'
  what():  [importCSVReaderSingleValues] problems parsing file (2)

really helpless for me. Is there a document for the user to look up possible reasons?
Thank you very much.

Answer 6 · 2018-10-22T08:02:41.000Z

Hi,

there is no document, unfortunately. We base our parser on boost.spirit and it is a bit tough to get the exact reason out. We just check whether the parser could read everything (and that it succeeded with what it read). It is possible to add this, and we would be happy to take a pull request (based on the current 4.1 branch), but have no time to do it ourselves.