import csv for regression
axiqia opened this issue · 6 comments
I want to use Rand Forest algorithm to solve a regression problem, and there is only one classfication example tutorial. So I try the bellow code to test,
RegressionDataset data;
importCSV(data, "/data/C.csv", LAST_COLUMN, ' ');
and after I ran, I get some error
terminate called after throwing an instance of 'shark::Exception'
what(): [importCSVReaderSingleValues] problems parsing file (2)
[1] 19082 abort (core dumped) ./ExampleProject
I have read the other regression algorithm tutorial, and I fond that all of them use the bellow importCSV
to load label and data ,respectively.
void importCSV(
Data<T>& data,
std::string fn,
char separator = ',',
char comment = '#',
std::size_t maximumBatchSize = Data<T>::DefaultBatchSize,
std::size_t titleLines = 0
)
How should I do to solve the problem? And is there someting I missed?
Shark version 3.1.0
Thank you.
Thank you for your quick rely. And I am sorry for not giving the details. I read Sample data set C.csv for test. And bellow is the first few lines.
14.1 0.7 7.4 5.4 5.4 4 1.3 5.4 8.7 6.7 3.4 1.3 4 1.3 3.4 8.7 6.7 8.1 1.3 2.7 0
12.5 0.7 8.3 2.1 4.2 6.9 1.4 5.6 10.4 9 2.1 6.9 1.4 4.2 2.8 4.2 4.2 9 1.4 2.8 0
19 0.7 4.9 5.6 7 11.3 1.4 1.4 7 6.3 3.5 4.2 2.1 3.5 0.7 8.5 4.2 4.9 2.8 0.7 0
20.4 0.7 4.1 2 2.7 13.6 3.4 5.4 8.2 7.5 3.4 2 2 4.8 2 6.8 0.7 6.8 1.4 2 0
5 11 0 3.9 9.1 3.9 7.1 7.1 5.8 12.3 12.3 1.9 1.3 2.6 3.2 2.6 3.9 3.2 5.2 1.3 1.9 0
You see, the entries are separate with space, and my option param is ' '
, so it made me confused.
The error means that the file can not be parsed using the options you supplied (e.g. you specify that entries are separate with space, not ','). Without seeing the actual file i have no way to tell you what is wrong.
…
________________________________ From: axiqia [notifications@github.com] Sent: Saturday, October 20, 2018 5:52 PM To: Shark-ML/Shark Cc: Subscribed Subject: [Shark-ML/Shark] import csv for regression (#257) I want to use Rand Forest algorithm to solve a regression problem, and there is only one classfication example tutorialhttp://image.diku.dk/shark/sphinx_pages/build/html/rest_sources/tutorials/algorithms/rf.html. So I try the bellow code to test, RegressionDataset data; importCSV(data, "/data/C.csv", LAST_COLUMN, ' '); and after I ran, I get some error terminate called after throwing an instance of 'shark::Exception' what(): [importCSVReaderSingleValues] problems parsing file (2) [1] 19082 abort (core dumped) ./ExampleProject I have read the other regression algorithm tutorial, and I fond that all of them use the bellow importCSV to load label and data ,respectively. void importCSV( Data& data, std::string fn, char separator = ',', char comment = '#', std::size_t maximumBatchSize = Data::DefaultBatchSize, std::size_t titleLines = 0 ) How should I do to solve the problem? And is there someting I missed? Shark version 3.1.0 Thank you. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#257>, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AOWTBuuNVRT0UZ1ZLaJy_BrzsxT6XQrZks5um0awgaJpZM4Xx_i_.
Thank you for your hint. I found out where the mistake was. I step through each line of my code and step into the importCSV
function, the param separator
is always setted as ','. Then I fond that importCSV
function has three generic types
/// \brief Import a Dataset from a csv file
void importCSV(
Data<T>& data,
std::string fn,
char separator = ',',
char comment = '#',
std::size_t maximumBatchSize = Data<T>::DefaultBatchSize,
std::size_t titleLines = 0
)
/// \brief Import a labeled Dataset from a csv file
template<class T>
void importCSV(
LabeledData<blas::vector<T>, unsigned int>& data,
std::string fn,
LabelPosition lp,
char separator = ',',
char comment = '#',
std::size_t maximumBatchSize = LabeledData<RealVector, unsigned int>::DefaultBatchSize
)
/// \brief Import a labeled Dataset from a csv file
template<class T>
void importCSV(
LabeledData<blas::vector<T>, blas::vector<T> >& data,
std::string fn,
LabelPosition lp,
std::size_t numberOfOutputs = 1,
char separator = ',',
char comment = '#',
std::size_t maximumBatchSize = LabeledData<RealVector, RealVector>::DefaultBatchSize
)
I realizeed I had to specify the param numberOfOutputs
. The brief description didn't tell the difference between tha last two function at all.
Why not design a unified interface? And if there is a regression example in the document, I think it wil help the new user like me a lot.
Thank you again.
Unified interface does not make sense.
The first version does not have a label, so it is confusing to have to specify a label position.
Second version is for class læabels. There can only be one column for that, so no need for number of outputs.
third version is for regression, there we can have vectorial labels.
We are still working on making the tutorials better, I will try to include that in a future Data section
Yeah, I have realized difference among the three version :). Maybe the comments should be as clear as you said.
And the error information like bellow
'shark::Exception'
what(): [importCSVReaderSingleValues] problems parsing file (2)
really helpless for me. Is there a document for the user to look up possible reasons?
Thank you very much.
Hi,
there is no document, unfortunately. We base our parser on boost.spirit and it is a bit tough to get the exact reason out. We just check whether the parser could read everything (and that it succeeded with what it read). It is possible to add this, and we would be happy to take a pull request (based on the current 4.1 branch), but have no time to do it ourselves.