[BUG] Loc still has some bugs
willeppy opened this issue · 5 comments
Issue 1
df = data.cars()
t = df.head(50)
df has correct history but t has a loc and slice included instead of just head
Issue 2
df = data.cars()
t = df.loc[55:100]
df has correct history but t has a slice included in addition to the loc call. However we need to override slice so that if I select without using loc like df[200:500
this still gets logged even though loc wasn't called.
I think you may have already fixed these but I just merged in the loc_log_support branch and these issues were still present in master so I think the fixes didn't get pushed. I went on and merged so the rest of the loc stuff is in master @Acornagain
New bug:
# cell 1
import lux
import pandas as pd
# cell 2
df = pd.read_csv("../data/movies-sample.csv")
# cell 3
df = df[~df.Worldwide_Gross.isna()] # any sort of filter here breaks the next step. e.g. df = df[df.Worldwide_Gross > df.Worldwide_Gross.median()]
# cell 4
df.Worldwide_Gross
When df.Worldwide_Gross
is called it leads to a KeyError for "Title" in the loc code somewhere
It turned out that after I am able to directly work on your repo, I stopped updating the original loc_log_supporrts
branch of my fork but just updating that in your repo, while the pull request still uses the original branch.
Besides, the first two bugs have been fixed at the branch of your repo, and I just created a pull request and merged it to the master branch.
However, the third bug is not due to the loc
/iloc
function.
The problem arises in the following steps:
- First because of the filter function call
df = df[~df.Worldwide_Gross.isna()]
, the last event of thedf
isfilter
and the new dataframe copies its history, which makes the last event ofdf.Worldwide_Gross
is alsofilter
. Then according to theimplicit_plotter
logic, the filter visualization is then triggered. - Inside the function
process_filter
, it assumes that the child dataframe and parent dataframe share the same set of columns and try to draw the filter view for each column. However, this fails because the child dataframe in this case has only one column. In your case, the error appeared because theTitle
column was not available in the child dataframe.
I fixed this by only requiring the filter view for columns shared by both the parent and child dataframe.
In addition, another bug (not an error) also appears after I fixed this bug, and I think this is related to the parental relationship.
To make it clear, I modify a few of your codes.
# cell 1
import lux
import pandas as pd
# cell 2
df = pd.read_csv("../data/movies-sample.csv")
# cell 3
newdf = df[~df.Worldwide_Gross.isna()] # any sort of filter here breaks the next step. e.g. df = df[df.Worldwide_Gross > newdf.Worldwide_Gross.median()]
# cell 4
newdf.Worldwide_Gross
When we call df.Worldwide_Gross
after the filter, I think what we need is the filter view of the column Worldwide_Gross
between the df
and newdf
. However, when calling newdf.Worldwide_Gross
, the parent of this new dataframe is set to be newdf
(because of the column reference logic) instead of df
. Therefore, we will see an all-true filter view.