cmudig/solas

[BUG] Loc still has some bugs

willeppy opened this issue · 5 comments

Issue 1

df = data.cars()
t = df.head(50)

df has correct history but t has a loc and slice included instead of just head

Issue 2

df = data.cars()
t = df.loc[55:100]

df has correct history but t has a slice included in addition to the loc call. However we need to override slice so that if I select without using loc like df[200:500 this still gets logged even though loc wasn't called.

I think you may have already fixed these but I just merged in the loc_log_support branch and these issues were still present in master so I think the fixes didn't get pushed. I went on and merged so the rest of the loc stuff is in master @Acornagain

New bug:

# cell 1
import lux
import pandas as pd

# cell 2 
df = pd.read_csv("../data/movies-sample.csv")

# cell 3 
df = df[~df.Worldwide_Gross.isna()] # any sort of filter here breaks the next step. e.g. df = df[df.Worldwide_Gross > df.Worldwide_Gross.median()]

# cell 4
df.Worldwide_Gross

When df.Worldwide_Gross is called it leads to a KeyError for "Title" in the loc code somewhere

It turned out that after I am able to directly work on your repo, I stopped updating the original loc_log_supporrts branch of my fork but just updating that in your repo, while the pull request still uses the original branch.

Besides, the first two bugs have been fixed at the branch of your repo, and I just created a pull request and merged it to the master branch.

However, the third bug is not due to the loc/iloc function.

The problem arises in the following steps:

  1. First because of the filter function call df = df[~df.Worldwide_Gross.isna()], the last event of the df is filter and the new dataframe copies its history, which makes the last event of df.Worldwide_Gross is also filter. Then according to the implicit_plotter logic, the filter visualization is then triggered.
  2. Inside the functionprocess_filter, it assumes that the child dataframe and parent dataframe share the same set of columns and try to draw the filter view for each column. However, this fails because the child dataframe in this case has only one column. In your case, the error appeared because the Title column was not available in the child dataframe.

I fixed this by only requiring the filter view for columns shared by both the parent and child dataframe.

In addition, another bug (not an error) also appears after I fixed this bug, and I think this is related to the parental relationship.

To make it clear, I modify a few of your codes.

# cell 1
import lux
import pandas as pd

# cell 2 
df = pd.read_csv("../data/movies-sample.csv")

# cell 3 
newdf = df[~df.Worldwide_Gross.isna()] # any sort of filter here breaks the next step. e.g. df = df[df.Worldwide_Gross > newdf.Worldwide_Gross.median()]

# cell 4
newdf.Worldwide_Gross

When we call df.Worldwide_Gross after the filter, I think what we need is the filter view of the column Worldwide_Gross between the df and newdf. However, when calling newdf.Worldwide_Gross, the parent of this new dataframe is set to be newdf (because of the column reference logic) instead of df. Therefore, we will see an all-true filter view.