Trajectory table not visible in staging.
Closed this issue ยท 37 comments
It does work on dev against the durham dataset.
Here's what I see in the admin dashboard logs
...
'data.section': '644ac9d8e7d077320f6b250a', 'data.filter': nan, 'data.floor': nan, 'data.sensed_speed': nan, 'data.vaccuracy': nan, 'data.accuracy': nan, 'data.elapsedRealtimeNanos': nan, 'data.local_dt_year': nan, 'data.local_dt_month': nan, 'data.local_dt_day': nan, 'data.local_dt_hour': nan, 'data.local_dt_minute': nan, 'data.local_dt_second': nan, 'data.local_dt_weekday': nan, 'data.local_dt_timezone': nan, 'data.user_id': nan, 'data.metadata_write_ts': nan}], export_format='csv', page_current=0, page_size=50, filter_options={'case': 'sensitive'}, sort_action='native', sort_mode='single', style_table={'overflowX': 'auto'}, style_cell={'textAlign': 'left'})`
--
In general, Dash properties can only be
dash components, strings, dictionaries, numbers, None,
or lists of those.
@achasmita do you see nan
values in the trajectory table in your testing?
@achasmita do you see
nan
values in the trajectory table in your testing?
Those all columns are empty on my table so it is working fine on my side.
and I do not see nan values either
maybe the nan values are just not displayed. Can you print out the last 5-6 rows to see if they do in fact have NaN values?
poking around on staging, this is what I see. If I try to load all trajectories, it takes a long time to load.
If I filter to only Sept (1/9 to 23/9), though, I get the same data as the previously loaded tab
UUID table
Demographic tab
Trajectory tab (still shows demographics)
UUID tab
Trajectory tab (still shows UUID)
Is this what you meant by "trajectory table not visible"? Again, detailed reporting would be good.
@achasmita is the trajectory table being filtered by the time range at the top of the dashboard?
I don't remember seeing that during the review.
@achasmita is the trajectory table being filtered by the time range at the top of the dashboard? I don't remember seeing that during the review.
No it is not filtered by time range
poking around on staging, this is what I see. If I try to load all trajectories, it takes a long time to load. If I filter to only Sept (1/9 to 23/9), though, I get the same data as the previously loaded tab
UUID table
### Demographic tab ### Trajectory tab (still shows demographics) ### UUID tab ### Trajectory tab (still shows UUID) Is this what you meant by "trajectory table not visible"? Again, detailed reporting would be good.
Yes it is loading previously loaded table instead of trajectory table.
No it is not filtered by time range
We should make sure that it is filtered by the time range since it is also time dependent, just like the trip table. That would also make it easier to debug issues like this since we can load the table in sections and find the time range with the invalid data.
I found that there are some extra columns in this dataset compared to previous dataset. So excluding these columns load the table for now.
EXCLUDED_TRAJECTORIES_COLS = [
'data.local_dt_year',
'data.local_dt_month',
'data.local_dt_day',
'data.local_dt_hour',
'data.local_dt_minute',
'data.local_dt_second',
'data.local_dt_weekday',
'data.local_dt_timezone',
'data.user_id',
'data.metadata_write_ts',
]
Most of above columns have Nan value. I am still checking columns values. Also there are already similar columns in table:
'data.local_dt.year',
'data.local_dt.month',
'data.local_dt.day',
'data.local_dt.hour',
'data.local_dt.minute',
'data.local_dt.second',
'data.local_dt.weekday',
'data.local_dt.timezone',
Most of above columns have Nan value. I am still checking columns values. Also there are already similar columns in table:
That is so bizarre. Let me see what I can find from the raw data
so the stage database has a lot of locations
>>> import emission.core.get_database as edb
>>> edb.get_analysis_timeseries_db().count_documents({"metadata.key": "analysis/recreated_location"})
322546
and most of them have data.local_dt.year
>>> edb.get_analysis_timeseries_db().count_documents({"data.local_dt.year": {"$exists": True}})
322535
there are only a handful of entries for data.local_dt_year
>>> edb.get_analysis_timeseries_db().count_documents({"data.local_dt_year": {"$exists": True}})
11
Looking more closely at those entries, most of them are from Apr of this year, although there are some from as late as June
>>> import pandas as pd
>>> pd.json_normalize(list(edb.get_analysis_timeseries_db().find({"data.local_dt_year": {"$exists": True}})))
>>> all_broken_data[["metadata.write_fmt_time"]]
metadata.write_fmt_time
0 2023-04-06T12:35:19.462114-07:00
1 2023-04-13T14:19:07.463665-07:00
2 2023-04-13T14:19:40.625654-07:00
3 2023-04-13T14:19:25.698613-07:00
4 2023-04-13T14:26:30.323517-07:00
5 2023-04-13T14:26:35.589886-07:00
6 2023-04-13T14:26:52.780819-07:00
7 2023-04-13T14:30:46.099731-07:00
8 2023-05-04T18:24:49.723777-07:00
9 2023-05-19T10:18:11.300575-07:00
10 2023-06-13T13:19:53.900964-07:00
And all of them have data.local_dt_year
>>> all_broken_data[["metadata.write_fmt_time", "data.local_dt_year"]]
metadata.write_fmt_time data.local_dt_year
0 2023-04-06T12:35:19.462114-07:00 2022
1 2023-04-13T14:19:07.463665-07:00 2022
2 2023-04-13T14:19:40.625654-07:00 2022
3 2023-04-13T14:19:25.698613-07:00 2022
4 2023-04-13T14:26:30.323517-07:00 2022
5 2023-04-13T14:26:35.589886-07:00 2022
6 2023-04-13T14:26:52.780819-07:00 2022
7 2023-04-13T14:30:46.099731-07:00 2023
8 2023-05-04T18:24:49.723777-07:00 2016
9 2023-05-19T10:18:11.300575-07:00 2023
10 2023-06-13T13:19:53.900964-07:00 2023
But do not have data.local_dt.year
>>> "data.local_dt.year" in all_broken_data.columns
False
@achasmita So it's pretty clear what is going on. Can you summarize the problem, and outline a principled solution? Given that this is a data issue, we should change the code to be robust to the data. I would prefer to not have a bunch of hardcoded workarounds determined by trial and error
I began by excluding additional columns present in this table compared to the previous table, as most of them contained NaN values. Now,I removed them from the excluded list and the table is functioning well, with the exception of the data.user_id
column. So, currently I have only data.user_id
column excluded.
I began by excluding additional columns present in this table compared to the previous table, as most of them contained NaN values. Now,I removed them from the excluded list and the table is functioning well, with the exception of the data.user_id column. So, currently I have only data.user_id column excluded.
@achasmita ok... that is not what I was expecting. But that doesn't explain why the table worked for Durham but not for stage. Does Durham not have data.user_id
?
yes there was no data.user_id
column in durham , First I tried replacing NaN value with " "but NaN value was not the problem, now all the columns are working as it include some binary data.
@achasmita Interesting - let me poke at the data again to see where those entries came from. I am glad to see that you were able to identify and work around it through manual inspection. But this indicates that dash is not robust to binary data in the tables. What if there is some other column in some other partner deployment that has one binary value? Our trajectory load will break.
Please look through the pandas documentation - let's see if there is a more principled fix that makes the system robust to bad data. If it takes too long to figure out a workaround, I am fine with filing an issue and putting it into the next release. However, if it is a simple fix, it would be great to include it now.
Re-running my previous commands, I see
>>> edb.get_analysis_timeseries_db().count_documents({"data.user_id": {"$exists": True}})
11
>>> edb.get_analysis_timeseries_db().count_documents({"data.user_id": {"$exists": False}})
1131337
It looks like these are basically the same incorrect 11 entries - every row in all_broken_data
has data.user_id
@achasmita as a hint - the dtype
of data.user_id
is object
. Maybe that can give us a clue as to how to identify binary values....
>>> all_broken_data['data.user_id']
Name: data.user_id, dtype: object
@achasmita Interesting - let me poke at the data again to see where those entries came from. I am glad to see that you were able to identify and work around it through manual inspection. But this indicates that dash is not robust to binary data in the tables. What if there is some other column in some other partner deployment that has one binary value? Our trajectory load will break.
Please look through the pandas documentation - let's see if there is a more principled fix that makes the system robust to bad data. If it takes too long to figure out a workaround, I am fine with filing an issue and putting it into the next release. However, if it is a simple fix, it would be great to include it now.
I will look into it.
@achasmita It looks like both strings and objects are marked with dtype object
>>> all_broken_data.dtypes
_id object
user_id object
metadata.key object
metadata.platform object
metadata.write_ts float64
metadata.time_zone object
metadata.write_local_dt.year int64
metadata.write_local_dt.month int64
metadata.write_local_dt.day int64
metadata.write_local_dt.hour int64
metadata.write_local_dt.minute int64
metadata.write_local_dt.second int64
metadata.write_local_dt.weekday int64
metadata.write_local_dt.timezone object
metadata.write_fmt_time object
data.accuracy float64
data.altitude float64
data.elapsedRealtimeNanos int64
data.filter object
data.fmt_time object
data.latitude float64
data.longitude float64
data.sensed_speed float64
data.ts float64
data.loc.type object
data.loc.coordinates object
data.local_dt_year int64
data.local_dt_month int64
data.local_dt_day int64
data.local_dt_hour int64
data.local_dt_minute int64
data.local_dt_second int64
data.local_dt_weekday int64
data.local_dt_timezone object
data.user_id object
data.metadata_write_ts float64
data.distance int64
data.speed int64
data.heading int64
data.idx int64
data.mode int64
data.section object
It seems like a fairly naive fix is, instead of hardcoding the binary columns
op-admin-dashboard/utils/db_utils.py
Line 129 in 35fc91d
we could just apply(str)
to any column of type 'object'. Calling str()
on a string won't hurt anything...
>>> str('abc')
'abc'
I used your method so now we don't need to have separate binary columns.
for col in df.columns:
if df[col].dtype == 'object':
df[col] = df[col].apply(str)
submit a PR, I will merge and deploy
@achasmita while making the change to the PR, please also remember to filter the trajectory by the date range
#70 (comment)
@achasmita I just checked the trajectories, and I wonder if we could simplify the column list a bit by default.
- Since we have
data.latitude
anddata.longitude
, we don't need any of thedata.loc
columns - since we have the fmt_time, we don't need any of the
data.local_dt
columns - we should convert the data.mode to a string based on the enum in
emission.core.wrapper
- for all programs other than the following, we should exclude the
user_id
field from the table. Can you write a script that will read each JSON file to be modified, add thedata_trajectories_columns_exclude: ['user_id']
field to each of them and write the modified file? You can then do a giant commit that changes all of them. - The list of programs that should not exclude the
user_id
is:ca-ebike
,stage-program
,stage-study
,uprm-civic
,uprm-nicr
,usaid-laos-ev
,uue
,washingtoncommons
. You should adddata_trajectories_columns_exclude: ['user_id']
to all other programs
Please let me know when this is done so that I can move to production.
As an aside, I only see 5000 pages in the stage trajectory table. Does this actually represent the full set of entries? It almost seems like the list is being cutoff - I wonder it is is by the max number of entries for each query.
https://github.com/search?q=repo%3Ae-mission%2Fe-mission-server+result_limit&type=code
We should fix this in the scalability changes as well.
@shankari kind of confused by the wording
5. The list of programs that should not exclude the
user_id
is:ca-ebike
,stage-program
,stage-study
,uprm-civic
,uprm-nicr
,usaid-laos-ev
,uue
,washingtoncommons
. You should adddata_trajectories_columns_exclude: ['user_id']
to all other programs
Does this mean that all JSON files get the
data_trajectories_columns_exclude: ['user_id']
?
Or just the JSON files mentioned above
e JSON files mentioned above
I tried searching as "keywords" in all those files but only the file names matched
(ca-ebike.nrel-op.json, stage-study.nrel-op.json, stage-program.nrel-op.json, etc.)
hence the confusion
Does this mean that all JSON files get the
There is one config per program. I have provided the list of configs that need to be modified. Why would all files be changed?
I tried searching as "keywords" in all those files but only the file names matched
I don't understand your question, why are you searching for keywords?
I was just confused with the wording, that's all. I understood that those JSON files were the point of concern but then this part:
5. to all other programs
confused me
to all other programs. The programs that are not listed are the other programs.
closing this since this has been resolved with a manual change from @niccolopaganini and the script is being tracked in a separate PR e-mission/nrel-openpath-deploy-configs#40