time-series-machine-learning/tsml-java

TSReader can't read data files that have new line characters in form \r\n

c-eg opened this issue · 10 comments

c-eg commented

TSReader cannot read in data files when the new line format is CRLF (\r\n). It only works if the new line format is LF (\n). I believe it is an issue with StreamTokenizer.

I've found that the problem line is 255 in TSReader (getLastToken). The ttype value is -3 (TT_WORD). Then when it reaches the \r of the first line in the .ts file, it becomes 13. However from the documentation, the ttype value apparently cannot be 13. I've debugged the tsml code and it isn't setting it to 13 anywhere, hence why I believe the issue to be with StreamTokenizer itself.

The exact code where it sets the ttype value to 13 is m_Tokenizer.nextToken(). Before this code, ttype was -3, and after it becomes 13.

c-eg commented

@ABostrom This issue is creeping up again as when the .ts files are uploaded onto the GitHub repository they are auto formatted to CRLF and cannot be changed to LF. Which means that the Data Handling examples using the .ts files from the repository in /experiments will not work.

I reckon i need to do some nasty hack, as its a problem with the Java class.

c-eg commented

Unfortunately that might be the only option, as refactoring TSReader to replace StreamTokenizer sounds like a lot of work...

Adding on to this, I had two issues when using TSReader recently.

  1. Blank lines between items in the header breaks the reader. I dont know if this should work, but some files on the beast have this.
  2. Values in scientific format arent read in corrently i.e. 2.61E+02, 2.61E+02: 0.0. Again i dont know if files should be in this format, but currently it doesnt handle it at all and splits each value into multiple ones (loading a file or series length 1725 lead to an Instances object of length 3000+).

@MatthewMiddlehurst I believe I have fixed issue 2.

Maybe it hasn't been merged in properly in to your local version? Or mine hasn't made it into the main branch.

@ABostrom Unsure when your fix would have been pushed, but this was on the HC2 branch which is usually kept up to date with master. As far as I am aware no other changes to the reader were made on the branch.

//Nasty hack to deal with Exponents not being supported in tokenizer - Look away! //Aaron 15/02/2021 if(m_Tokenizer.sval != null && m_Tokenizer.sval.contains("E")){ val = Double.parseDouble(m_Tokenizer.nval + m_Tokenizer.sval); //remove the value we just added. as it was a partial. timeSeries.remove(timeSeries.size()-1); }

In the TSReader there is a nasty hack to deal with this issue. As i was confident I had fixed it. dated 15th of Feb this year.

Issue was the sktime .ts writer wrote some lines using scientific while the majority were normal. Don’t know if that would break everything, but I definitely remember having to replace those lines around 1st June as it wouldn’t read in correctly.

Master was merged into the HC2 branch in May so it would have that commit.

Right. I'm going to look at how painful this is going to be to rewrite. Bloody tokenizer is crap.

c-eg commented

@ABostrom is this fixed now in #519? If so, I'll close the issue