e-mission/op-admin-dashboard

Trajectory table not visible in staging.

Closed this issue ยท 37 comments

It does work on dev against the durham dataset.

Here's what I see in the admin dashboard logs

...
'data.section': '644ac9d8e7d077320f6b250a', 'data.filter': nan, 'data.floor': nan, 'data.sensed_speed': nan, 'data.vaccuracy': nan, 'data.accuracy': nan, 'data.elapsedRealtimeNanos': nan, 'data.local_dt_year': nan, 'data.local_dt_month': nan, 'data.local_dt_day': nan, 'data.local_dt_hour': nan, 'data.local_dt_minute': nan, 'data.local_dt_second': nan, 'data.local_dt_weekday': nan, 'data.local_dt_timezone': nan, 'data.user_id': nan, 'data.metadata_write_ts': nan}], export_format='csv', page_current=0, page_size=50, filter_options={'case': 'sensitive'}, sort_action='native', sort_mode='single', style_table={'overflowX': 'auto'}, style_cell={'textAlign': 'left'})`
--
In general, Dash properties can only be
dash components, strings, dictionaries, numbers, None,
or lists of those.

@achasmita do you see nan values in the trajectory table in your testing?

@achasmita do you see nan values in the trajectory table in your testing?

Those all columns are empty on my table so it is working fine on my side.
Screen Shot 2023-09-28 at 3 55 39 PM

and I do not see nan values either

maybe the nan values are just not displayed. Can you print out the last 5-6 rows to see if they do in fact have NaN values?

There is actually NaN values in previous dataset
Screen Shot 2023-09-28 at 5 01 11 PM

@achasmita

poking around on staging, this is what I see. If I try to load all trajectories, it takes a long time to load.
If I filter to only Sept (1/9 to 23/9), though, I get the same data as the previously loaded tab

UUID table

Screenshot 2023-09-28 at 5 24 44 PM

Demographic tab

Screenshot 2023-09-28 at 5 26 56 PM

Trajectory tab (still shows demographics)

Screenshot 2023-09-28 at 5 29 18 PM

UUID tab

Screenshot 2023-09-28 at 5 30 30 PM

Trajectory tab (still shows UUID)

Screenshot 2023-09-28 at 5 31 51 PM

Is this what you meant by "trajectory table not visible"? Again, detailed reporting would be good.

@achasmita is the trajectory table being filtered by the time range at the top of the dashboard?
I don't remember seeing that during the review.

@achasmita is the trajectory table being filtered by the time range at the top of the dashboard? I don't remember seeing that during the review.

No it is not filtered by time range

@achasmita

poking around on staging, this is what I see. If I try to load all trajectories, it takes a long time to load. If I filter to only Sept (1/9 to 23/9), though, I get the same data as the previously loaded tab

UUID table

Screenshot 2023-09-28 at 5 24 44 PM ### Demographic tab Screenshot 2023-09-28 at 5 26 56 PM ### Trajectory tab (still shows demographics) Screenshot 2023-09-28 at 5 29 18 PM ### UUID tab Screenshot 2023-09-28 at 5 30 30 PM ### Trajectory tab (still shows UUID) Screenshot 2023-09-28 at 5 31 51 PM Is this what you meant by "trajectory table not visible"? Again, detailed reporting would be good.

Yes it is loading previously loaded table instead of trajectory table.

@achasmita

No it is not filtered by time range

We should make sure that it is filtered by the time range since it is also time dependent, just like the trip table. That would also make it easier to debug issues like this since we can load the table in sections and find the time range with the invalid data.

I found that there are some extra columns in this dataset compared to previous dataset. So excluding these columns load the table for now.

EXCLUDED_TRAJECTORIES_COLS = [
    'data.local_dt_year',
    'data.local_dt_month',
    'data.local_dt_day',
    'data.local_dt_hour',
    'data.local_dt_minute',
    'data.local_dt_second',
    'data.local_dt_weekday',
    'data.local_dt_timezone',
    'data.user_id',
    'data.metadata_write_ts',
]

Most of above columns have Nan value. I am still checking columns values. Also there are already similar columns in table:

'data.local_dt.year',
'data.local_dt.month',
'data.local_dt.day',
'data.local_dt.hour',
'data.local_dt.minute',
'data.local_dt.second',
'data.local_dt.weekday',
'data.local_dt.timezone',

Screen Shot 2023-09-28 at 7 15 39 PM
For now I am checking first five and last five data for the excluded columns.

Most of above columns have Nan value. I am still checking columns values. Also there are already similar columns in table:

That is so bizarre. Let me see what I can find from the raw data

so the stage database has a lot of locations

>>> import emission.core.get_database as edb

>>> edb.get_analysis_timeseries_db().count_documents({"metadata.key": "analysis/recreated_location"})
322546

and most of them have data.local_dt.year

>>> edb.get_analysis_timeseries_db().count_documents({"data.local_dt.year": {"$exists": True}})
322535

there are only a handful of entries for data.local_dt_year

>>> edb.get_analysis_timeseries_db().count_documents({"data.local_dt_year": {"$exists": True}})
11

Looking more closely at those entries, most of them are from Apr of this year, although there are some from as late as June

>>> import pandas as pd
>>> pd.json_normalize(list(edb.get_analysis_timeseries_db().find({"data.local_dt_year": {"$exists": True}})))
>>> all_broken_data[["metadata.write_fmt_time"]]
             metadata.write_fmt_time
0   2023-04-06T12:35:19.462114-07:00
1   2023-04-13T14:19:07.463665-07:00
2   2023-04-13T14:19:40.625654-07:00
3   2023-04-13T14:19:25.698613-07:00
4   2023-04-13T14:26:30.323517-07:00
5   2023-04-13T14:26:35.589886-07:00
6   2023-04-13T14:26:52.780819-07:00
7   2023-04-13T14:30:46.099731-07:00
8   2023-05-04T18:24:49.723777-07:00
9   2023-05-19T10:18:11.300575-07:00
10  2023-06-13T13:19:53.900964-07:00

And all of them have data.local_dt_year

>>> all_broken_data[["metadata.write_fmt_time", "data.local_dt_year"]]
             metadata.write_fmt_time  data.local_dt_year
0   2023-04-06T12:35:19.462114-07:00                2022
1   2023-04-13T14:19:07.463665-07:00                2022
2   2023-04-13T14:19:40.625654-07:00                2022
3   2023-04-13T14:19:25.698613-07:00                2022
4   2023-04-13T14:26:30.323517-07:00                2022
5   2023-04-13T14:26:35.589886-07:00                2022
6   2023-04-13T14:26:52.780819-07:00                2022
7   2023-04-13T14:30:46.099731-07:00                2023
8   2023-05-04T18:24:49.723777-07:00                2016
9   2023-05-19T10:18:11.300575-07:00                2023
10  2023-06-13T13:19:53.900964-07:00                2023

But do not have data.local_dt.year

>>> "data.local_dt.year" in all_broken_data.columns
False

@achasmita So it's pretty clear what is going on. Can you summarize the problem, and outline a principled solution? Given that this is a data issue, we should change the code to be robust to the data. I would prefer to not have a bunch of hardcoded workarounds determined by trial and error

I began by excluding additional columns present in this table compared to the previous table, as most of them contained NaN values. Now,I removed them from the excluded list and the table is functioning well, with the exception of the data.user_id column. So, currently I have only data.user_id column excluded.

I began by excluding additional columns present in this table compared to the previous table, as most of them contained NaN values. Now,I removed them from the excluded list and the table is functioning well, with the exception of the data.user_id column. So, currently I have only data.user_id column excluded.

@achasmita ok... that is not what I was expecting. But that doesn't explain why the table worked for Durham but not for stage. Does Durham not have data.user_id?

yes there was no data.user_id column in durham , First I tried replacing NaN value with " "but NaN value was not the problem, now all the columns are working as it include some binary data.

There is no more excluded columns.
Screen Shot 2023-09-28 at 8 39 49 PM
Screen Shot 2023-09-28 at 8 40 45 PM
Screen Shot 2023-09-28 at 8 41 18 PM
Screen Shot 2023-09-28 at 8 41 55 PM
Screen Shot 2023-09-28 at 8 42 18 PM

@achasmita Interesting - let me poke at the data again to see where those entries came from. I am glad to see that you were able to identify and work around it through manual inspection. But this indicates that dash is not robust to binary data in the tables. What if there is some other column in some other partner deployment that has one binary value? Our trajectory load will break.

Please look through the pandas documentation - let's see if there is a more principled fix that makes the system robust to bad data. If it takes too long to figure out a workaround, I am fine with filing an issue and putting it into the next release. However, if it is a simple fix, it would be great to include it now.

Re-running my previous commands, I see

>>> edb.get_analysis_timeseries_db().count_documents({"data.user_id": {"$exists": True}})
11
>>> edb.get_analysis_timeseries_db().count_documents({"data.user_id": {"$exists": False}})
1131337

It looks like these are basically the same incorrect 11 entries - every row in all_broken_data has data.user_id
@achasmita as a hint - the dtype of data.user_id is object. Maybe that can give us a clue as to how to identify binary values....

>>> all_broken_data['data.user_id']
Name: data.user_id, dtype: object

@achasmita Interesting - let me poke at the data again to see where those entries came from. I am glad to see that you were able to identify and work around it through manual inspection. But this indicates that dash is not robust to binary data in the tables. What if there is some other column in some other partner deployment that has one binary value? Our trajectory load will break.

Please look through the pandas documentation - let's see if there is a more principled fix that makes the system robust to bad data. If it takes too long to figure out a workaround, I am fine with filing an issue and putting it into the next release. However, if it is a simple fix, it would be great to include it now.

I will look into it.

@achasmita It looks like both strings and objects are marked with dtype object

>>> all_broken_data.dtypes
_id                                  object
user_id                              object
metadata.key                         object
metadata.platform                    object
metadata.write_ts                   float64
metadata.time_zone                   object
metadata.write_local_dt.year          int64
metadata.write_local_dt.month         int64
metadata.write_local_dt.day           int64
metadata.write_local_dt.hour          int64
metadata.write_local_dt.minute        int64
metadata.write_local_dt.second        int64
metadata.write_local_dt.weekday       int64
metadata.write_local_dt.timezone     object
metadata.write_fmt_time              object
data.accuracy                       float64
data.altitude                       float64
data.elapsedRealtimeNanos             int64
data.filter                          object
data.fmt_time                        object
data.latitude                       float64
data.longitude                      float64
data.sensed_speed                   float64
data.ts                             float64
data.loc.type                        object
data.loc.coordinates                 object
data.local_dt_year                    int64
data.local_dt_month                   int64
data.local_dt_day                     int64
data.local_dt_hour                    int64
data.local_dt_minute                  int64
data.local_dt_second                  int64
data.local_dt_weekday                 int64
data.local_dt_timezone               object
data.user_id                         object
data.metadata_write_ts              float64
data.distance                         int64
data.speed                            int64
data.heading                          int64
data.idx                              int64
data.mode                             int64
data.section                         object

It seems like a fairly naive fix is, instead of hardcoding the binary columns

for col in constants.BINARY_TRAJECTORIES_COLS:

we could just apply(str) to any column of type 'object'. Calling str() on a string won't hurt anything...

>>> str('abc')
'abc'

I used your method so now we don't need to have separate binary columns.

   for col in df.columns:
            if df[col].dtype == 'object':
                df[col] = df[col].apply(str)  

submit a PR, I will merge and deploy

@achasmita while making the change to the PR, please also remember to filter the trajectory by the date range
#70 (comment)

@achasmita I just checked the trajectories, and I wonder if we could simplify the column list a bit by default.

  1. Since we have data.latitude and data.longitude, we don't need any of the data.loc columns
  2. since we have the fmt_time, we don't need any of the data.local_dt columns
  3. we should convert the data.mode to a string based on the enum in emission.core.wrapper
  4. for all programs other than the following, we should exclude the user_id field from the table. Can you write a script that will read each JSON file to be modified, add the data_trajectories_columns_exclude: ['user_id'] field to each of them and write the modified file? You can then do a giant commit that changes all of them.
  5. The list of programs that should not exclude the user_id is: ca-ebike, stage-program, stage-study, uprm-civic, uprm-nicr, usaid-laos-ev, uue, washingtoncommons. You should add data_trajectories_columns_exclude: ['user_id'] to all other programs

Please let me know when this is done so that I can move to production.

As an aside, I only see 5000 pages in the stage trajectory table. Does this actually represent the full set of entries? It almost seems like the list is being cutoff - I wonder it is is by the max number of entries for each query.
https://github.com/search?q=repo%3Ae-mission%2Fe-mission-server+result_limit&type=code

We should fix this in the scalability changes as well.

@shankari kind of confused by the wording

5. The list of programs that should not exclude the user_id is: ca-ebike, stage-program, stage-study, uprm-civic, uprm-nicr, usaid-laos-ev, uue, washingtoncommons. You should add data_trajectories_columns_exclude: ['user_id'] to all other programs

Does this mean that all JSON files get the

data_trajectories_columns_exclude: ['user_id']

?

Or just the JSON files mentioned above

e JSON files mentioned above

I tried searching as "keywords" in all those files but only the file names matched

(ca-ebike.nrel-op.json, stage-study.nrel-op.json, stage-program.nrel-op.json, etc.)

hence the confusion

Does this mean that all JSON files get the

There is one config per program. I have provided the list of configs that need to be modified. Why would all files be changed?

I tried searching as "keywords" in all those files but only the file names matched

I don't understand your question, why are you searching for keywords?

I was just confused with the wording, that's all. I understood that those JSON files were the point of concern but then this part:

5. to all other programs

confused me

to all other programs. The programs that are not listed are the other programs.

closing this since this has been resolved with a manual change from @niccolopaganini and the script is being tracked in a separate PR e-mission/nrel-openpath-deploy-configs#40