quoted string saved as "inf"

Question

quoted string saved as "inf"

Closed this issue 5 years ago · 7 comments

I have a csv file with:

"hash_value","hash_fail"
"75f0f686118f3771c2cdec71a1cfc0c8","65e1337962"

in it.

When this gets bulk inserted, the hash_fail value is stored as string "inf":

GRAPH.QUERY testme "MATCH (h:hash_fail) RETURN h"
1) 1) "h"
2) 1) 1) 1) 1) "id"
            2) (integer) 0
         2) 1) "labels"
            2) 1) "hash_fail"
         3) 1) "properties"
            2) 1) 1) "hash_value"
                  2) "75f0f686118f3771c2cdec71a1cfc0c8"
               2) 1) "hash_fail"
                  2) "inf"

The only time it has imported correctly is by setting the quote level to 3.

(redisgraph-bulk-loader) ➜  redisgraph-bulk-loader git:(master) ✗ redis-cli -p 6381 GRAPH.DELETE testme
"Graph removed, internal execution time: 0.022300 milliseconds"
(redisgraph-bulk-loader) ➜  redisgraph-bulk-loader git:(master) ✗ python bulk_insert.py testme -h redis-graph -p 6381 -n "hash_fail.csv" -q 3
1 nodes created with label 'hash_fail'
Construction of graph 'testme' complete: 1 nodes created, 0 relations created in 0.023432 seconds
(redisgraph-bulk-loader) ➜  redisgraph-bulk-loader git:(master) ✗ redis-cli -p 6381 GRAPH.QUERY testme "MATCH (h:hash_fail) RETURN h"
1) 1) "h"
2) 1) 1) 1) 1) "id"
            2) (integer) 0
         2) 1) "labels"
            2) 1) "hash_fail"
         3) 1) "properties"
            2) 1) 1) "\"hash_value\""
                  2) "\"75f0f686118f3771c2cdec71a1cfc0c8\""
               2) 1) "\"hash_fail\""
                  2) "\"65e1337962\""
3) 1) "Query internal execution time: 0.540700 milliseconds"
(redisgraph-bulk-loader) ➜  redisgraph-bulk-loader git:(master) ✗ redis-cli -p 6381 GRAPH.DELETE testme

Every other variation of the -q parameter stores the hash_fail value as "inf"

Also note: I've only found that the specific value above (ie., hash_fail) fails. Other values, e.g. "d18f044cfd0d9c9a8a7326ddda030106","b8265e95a6" work fine.

Answer 1 · 2019-11-14T22:31:43.000Z

This should be fixed by #16, which was just merged!

Sorry for the strange bug - if it makes up for the confusion at all, this was actually kind of funny? In both the bulk loader and the module, "inf" is a float, not a string! 65e1337962, alternately written as 65 * 10^1337962, is a number sufficiently large that both languages agree that it might as well be infinite. Henceforth, I'm restricting RedisGraph to the realm of finite mathematics!

Answer 2 · 2019-11-14T23:03:13.000Z

Haha, yeah, I wondered if there was something like that going on. Thanks!

Answer 3 · 2019-11-14T23:45:38.000Z

I've confirmed the fix. Thanks again. We love and appreciate the work you're doing

Answer 4 · 2019-11-14T23:55:36.000Z

Our pleasure! Sorry the experience has been a bit mixed lately.

As a user of this tool, would you like to have the ability to electively enforce schema-like typing for different fields? We currently have the --fields argument that allows for this to some degree, but it's not the most intuitive. It seems like a lot of recent issues have involved Python's different behaviors in quote-parsing levels and type conversions, so if your data is consistent, it might be easier to label a column as solely containing floats, strings, etc, and process accordingly.

Answer 5 · 2019-11-15T16:55:15.000Z

We do something like that already when loading the source data from a flat file -- we give pandas a mapping file of data types. That specific file may refer to different field names by the time we are bulk importing. All this to say, we probably could/would use it if it guaranteed data types.

Answer 6 · 2019-11-15T18:21:34.000Z

So far I've been steering clear of the --fields argument for a couple reasons.

First, the documentation on the expected format of the field type data was not very clear to me. Without examples I was not sure what correct format for specifying a node's field types should look like or if it was necessary to specify each node/field name combination or if you could leave some blank and only specify the ones that were different or outside the default. Perhaps some examples of the correct usage would be useful here. Currently the documentation only says this about it....

json to set explicit types for each field, format {:[, ...]} where type can be 0(null),1(bool),2(numeric),3(string)

Our current import process is code driven and we do have a data model so it's at least possible to generate the field definition at this step if the mechanism to do so exists. Up to now I've been attempting to use some automated type detection functionality (ie pandas) to detect the fields that are strings and quote them while saving to csv (in our case pandas is used to transform the data and we save various pandas dataframes to csv to serve as input to the importer) but if we need to do more to use inform the bulk importer of the proper types we could implement it that way.

Answer 7 · 2020-06-25T18:12:34.000Z

The --enforce-schema flag and associated header format are now the recommended method for avoiding type inference issues.