long stop id causes casting error
Georgea75 opened this issue · 4 comments
Description
When building PTInteraction benchmarks for project monty the long stop ids are generating a casting error when they are saved to the csv and then reloaded again to generate the benchmark files. This casting error turns an ID such as 7857884190914906000 into its scientific notation form 7.857884190914906e+18, when this form is casted as a string it retains the scientific notation form ("7.857884190914906e+18") and therefore the stop IDs are not merging, as "7.857884190914906e+18" != "7857884190914906000". This is results in an empty benchmark.
Quick fix
To test the benchmark for monty I have added the following which forces pandas to read the ids as string which means it never converts it to standard form.
line 731 results_df = pd.read_csv(path, index_col=0)
-> results_df = pd.read_csv(path, index_col=0,dtype=str)
In order to allow for the counters to still sum when need to force the counts back to a numeric value so I added
line 732 results_df[[str(h) for h in range(24)]] = results_df[[str(h) for h in range(24)]].apply(pd.to_numeric)
This solved the issue for testing purposes but I imagine the problem could be more wide spread so a better solution is most likely needed.
Other issues
Python's int size simplifies the id if you load it in as an int so either keeping it as a float or string is necessary.
Potential fixes
- Fix all stop ids to a smaller length: I am unsure of the implications of this.
- Enforce stop ids to be strings to ensure they are never loaded as numeric types, i.e, by adding _stop to the end of the idea. I think this is why we did not encounter this issue with link ids as the take the form stopid - stopid so the hyphen forces it to be a string.
to be pedantic - where does the casting occur:
- elara.input reads the schedule input from xml, so the indexes in elara at this point will be string I'm pretty sure (NOT CHECKED THOUGH).
- I like string ids.
- elara.event_handlers does it's thing and outputs a big csv of required counts. But I still expect the output indices to be string type (NOT CHECKED). Therefore i do not expect pandas to have done any scientic notation. Therefore I expect the csv indices to be strings.
- elara.benchmark read in the bm data as a dict from json. So they pressumably get nice integers if we want or strings if we prefer - I gather you are using strings.
- However, elara.benchmark pd.read_csvs in the csv of sim results (from elara.event_handlers) and somehow gets something that isn't string???
- One of my above steps must be wrong.
For your solution, you can more simply specify the datatype for individual columns using `pd.read_csv(path, dtype={"id":str}) I think. Based on my logic above i am happy for you to do this as I am expecting a str regardless. But my logic is wrong somewhere so who knows.
Ultimately happy for you to make changes if the tests still pass.
If we were to be very careful we could add in a test that represents this problem. But we would have to build new test data and so on. So maybe not unless you are twiddling your thumbs.
I will also talk to Kasia about avoiding massive integers and being consistent with type, ideally strings.
Hey, a response to your points:
- Yes, this is my understanding as well.
- Yes, at this point the CSV has the correct IDs.
- Yes, this is correct and the Ids are valid at read from the benchmark.
- Correct, this is where the error occurs:
results_df = pd.read_csv(path, index_col=0)
results_df = results_df.groupby(results_df.index).sum()
results_df = results_df[[str(h) for h in range(24)]]
results_df.index = results_df.index.map(str)
results_df.index.name = 'stop_id'
What is happening line by line
results_df = pd.read_csv(path, index_col=0)
loads the index as a numpy.float64. At this point python displays the float as 1.211729924256132e+19. The type is inferred as float as the data in the csv takes the form 12117299242561318912. Maybe "12117299242561318912" would fix this?
Next line of interest is
results_df.index = results_df.index.map(str)
this converts the index which is numpy.float64 to str , which generates "1.211729924256132e+19"
Therefore, it finds no matches as "1.211729924256132e+19" != "12117299242561318912"
I have changed my solution to use dtype={0:str} as you suggested :) Very happy to write a test as well. But would we have to make a test to cover this issue for each benchmark?
Action
Do you want me to make a branch change all cases of read_csv from the csv dump to dtype={0:str} as this issue may occur for every benchmark type? Or should I just make the change for PTInteraction? Or can I just make the changes to the current new-zealand-branch (At this point this branch would cover several things, adding each of the four new nz-benchmarks as well as this load change)
awesome explanation.
I have trouble reproducing, eg:
In [96]: df = pd.DataFrame(["99999999999999999999999999999999999"]*5, columns=["a"])
In [97]: df
Out[97]:
a
0 99999999999999999999999999999999999
1 99999999999999999999999999999999999
2 99999999999999999999999999999999999
3 99999999999999999999999999999999999
4 99999999999999999999999999999999999
In [98]: df.a
Out[98]:
0 99999999999999999999999999999999999
1 99999999999999999999999999999999999
2 99999999999999999999999999999999999
3 99999999999999999999999999999999999
4 99999999999999999999999999999999999
Name: a, dtype: object
In [99]: df.to_csv(path)
In [100]: df = pd.read_csv(path)
In [101]: df.a
Out[101]:
0 99999999999999999999999999999999999
1 99999999999999999999999999999999999
2 99999999999999999999999999999999999
3 99999999999999999999999999999999999
4 99999999999999999999999999999999999
Name: a, dtype: object
But i trust you and i like strings so please:
- force string index - please do this for every bm - i think it's a good test
- don't worry about new tests
- happy for you to include on your NZ branch
Excellent, I will make the changes today