Feature request: Differentiate between empty string and null value
StefRe opened this issue · 5 comments
readAllWithHeader
yields a List<Map<String,String>>
and hence empty columns are being read as empty strings, so that we get ""
for both col1
and col2
in the following example:
"col1","col2"
"",
I'd really like to get null
for col2
here (this of course only makes sense if all strings are quoted, otherwise it wouldn't be clear how to interpret empty columns). I understand that you can't change the result to List<Map<String,String?>>
now, but maybe you could add a nullCode
option for reading as it already exists for writing. The default value is an empty string ""
(=current behavior). I could then simply do
val nullCode = "NULL"
val rows = csvReader(nullCode=nullCode).readAllWithHeader(inputStream)
.map { row -> row.mapValues { col -> if (col.value == nullCode) null else col.value } }
At first glance it seems that it only requires to change
https://github.com/doyaaaaaken/kotlin-csv/blob/c23a51b98fbbcb1348b3928d561f1f5e11fba965/src/commonMain/kotlin/com/github/doyaaaaaken/kotlincsv/parser/ParseStateMachine.kt#L36-L48
to
delimiter -> {
field.append(nullCode)
flushField()
state = ParseState.DELIMITER
}
'\n', '\u2028', '\u2029', '\u0085' -> {
field.append(nullCode)
flushField()
state = ParseState.END
}
'\r' -> {
if (nextCh == '\n') pos += 1
field.append(nullCode)
flushField()
state = ParseState.END
}
but I didn't check it thoroughly.
@StefRe
Thank you for your feedback!!
I'll think about this on weekend, thanks.
@StefRe
Thank you for your input.
In your case, you want to regard ""
as empty string and `` as null value.
It's maybe possible to have a case like the one you encountered.
However, I looked at the RFC 4180 and considered carefully, then I've felt that your case is a special case in terms of the CSV specification.
I'm very sorry, but I'll wait to deal with it until I get same feedback from others.
Thanks for your reply.
RFC 4180 just describes the file format with empty fields being explicitly allowed:
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
However, it doesn't say anything about how to interpret the data, specifically it doesn't mention if an empty field should be an empty string or null
.
For my use case (a mock ResultSet
backed by csv data) it is essential to differentiate between an empty string ""
and null
.
OpenCSV ("com.opencsv:opencsv:5.4"
) provides fine-grained configuration of what the parser should consider null
:
EMPTY_SEPARATORS - two sequential separators are null. EMPTY_QUOTES - two sequential quotes are null BOTH - both are null NEITHER - default. Both are considered empty string.
and allows to do exactly what I need:
import com.opencsv.CSVReaderHeaderAware
import com.opencsv.CSVReaderHeaderAwareBuilder
import com.opencsv.enums.CSVReaderNullFieldIndicator
val csv = """
"empty col","null col"
"",
""".trimIndent()
fun main() {
csv.byteInputStream().bufferedReader().use { br ->
(CSVReaderHeaderAwareBuilder(br)
.withFieldAsNull(CSVReaderNullFieldIndicator.EMPTY_SEPARATORS)
.build() as CSVReaderHeaderAware)
.use { cr ->
val csvData = cr.readMap()
assert(csvData["empty col"] == "")
assert(csvData["null col"] == null)
}
}
}
So I can switch to OpenCSV but it's a bit heavy-weight as it does far more than I actually need so I'd prefer to stick with your lib for Kotlin projects.
@StefRe
Well, thank you for the good suggestions.
I'm convinced of your opinion.
Especially that configuration of OpenCSV seems to be also applicable for kotlin-csv.
I'd like to consider this topic as a feature request and try to implement it.
I really want this feature...