KnowageLabs/Knowage-Server

python data set : to be able to recognize date and timestamp attributes from Pandas dataFrame

Closed this issue · 7 comments

Describe the bug
We wanted to use python data set in place of REST data set, to be able to perform post treatment on retrieved data, it works OK, but there no way to get date or timestamp attribute, the mapping done is only a number or a string, if we set the type to date or timestamp in the dataFrame, an error occurs when doing a PREVIEW : "Data is neither a JSON object nor a JSON array", (exception coming from JSONPathDataReader)

=> mentioned the problem in Knowage QA as question How to get timestamp attribute from a python data set, using Pandas, with source dateTime attributes in ISO 8601

To Reproduce
Steps to reproduce the behavior:

  1. create a python data set
  2. build a simple dataFrame with some timestamp or date column inside, set the type of the column to timestamp or date, return it as the result (see python code in Additional context).
  3. click PREVIEW
  4. an error "Data is neither a JSON object nor a JSON array" is returned

Expected behavior
to be able to have date and timestamp attributes recognized from the dataFrame in a python data set.

environment

  • Knowage 7.4.5 with helm chart

Additional context
python code to reproduce the problem :

import pandas as pd
import numpy as np
import datetime


json = '[{"id": "truc:17","type": "truc","dateObserved": "2021-08-26T15:02:44.00Z"}]'

df = pd.read_json(json)


df = df.replace({np.nan: ""})

df['dateObserved'] = pd.to_datetime(df['dateObserved'])

data = df

"data" is the output dataFrame

screenshot of the error :

Capture d’écran 2021-08-31 à 11 20 12

for @HugoT852 information

Despite the bug, there is a missing piece, since there's no way to retrieve date or timestamp attributes...
I see 3 different solutions

  1. either an interface is provided to declare jsonPath attributes, like for REST dataset, in order to have the right metadata
  2. or an additional information about types could be sent from the script, beside the dataFrame, to inform about types of columns, and then perform the correct mapping, problem the data set metadata, won't be aligned...
  3. automatic rules based on the name of the dataFrame column containing date or time, and format verification, which set the right type, to build jsonPath attributes, if done at data set PREVIEW, metadata should be prepared, and saved with the data set as jsonPath attributes.

To my opinion the first solution seems the best, with an interface like the "JSON path Attributes" one in REST data set configuration, which allows the user to declare the mapping of some of the dataFrame columns, and then build the necessary metadata for the python dataset.

and then instead of "JSON path Attributes", the title of the box should be : "dataFrame Columns Mapping"

Hello,
thank you for your report.

I have released a patch that should fix the issue.
It will be available in version 8.0.

Bye,
Marco

I just saw your most recent comment.

The provided solution is based on type inference, which means that if a dataFrame column is of type pandas.Timestamp, it will be mapped as a timestamp in Knowage, otherwise it will be treated as a number/string.

Unfortunately it looks like pandas doesn't work with dates, but treats everything as a timestamp. I did some research about it but didn't find anything useful.

Of course feel free to contribute if you wish.

Bye,
Marco

Hello Marco,
OK for inference on type, but where do you catch the type of dataFrame columns ?, it should be in python wrapper that the column's types have to be sent as metadata besides the data itself, in the json response ?

Pandas works with both date/datetime and timestamp but as datetime64[ns] type see : https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

So you should infer on timestamp, datetime* and date types

Contributing, we have the will to !, but since we work with Docker and Kubernetes even for development, and GitLab for CI/CD, laying down everything together for building Knowage with source, to deployment is not so simple to pipeline...

however we think seriously about it...

The Python engine serializes the DataFrame into JSON format, in order to transmit it over the network. The serialized DataFrame is then received by Knowage that parses the JSON and infers column metadata by using a trial and error mechanism.

Regarding pandas and datetime, I have tried to create a dataframe as follows
df = pd.DataFrame({'id': [1,2],'str': ['a', 'b'],'date': [np.datetime64("2010-01-01"), np.datetime64("2020-02-02")]})
but the result type of the date column is still pandas.Timestamp

Hello Marco, From dataFrame side df.dtypes give always datetime64[ns], but you probably catch the type by reflection !?
I see that you're working on the problem, thanks a lot !