Can we avoid having a cell with a list?

Question

Can we avoid having a cell with a list?

jbesomi opened this issue 4 years ago · 18 comments

As we know, it's really not recommended to store a list in a Pandas cell. TokenSeries and VectorSeries, two of the core ideas of (current) Texthero are actually using it, can this process be avoided?

Need to discuss:

Alternatives using sub-columns (it's still MultiIndex). Understand how complex and flexible this solution is. 99% of the cases, the standard Pandas/Texthero user does not really know how to work with MultiIndex ...
Can we just use RepresentationSeries? Probably not as we cannot merge it into a DataFrame with a single index, other alternatives than data alignment with reindex (too complicated)?

@mk2510 @henrifroese

Answer 1 · 2020-08-13T19:20:02.000Z

As discussed, @mk2510 and I are fans of the sub-column approach. In short, instead of

                        nmf
0                       [0.0, 0.0, 0.548022198777919]
1                       [1.6801770114633772, 0.0, 0.0]
2                       [1.678910685857782, 0.0, 0.00015405496961668698]
3                       [0.0, 2.1819904843216302, 0.0]
dtype: object

which is really inefficient and hard to understand for users, we put the Multiindex in the columns and get this:

        nmf                    
       nmf1      nmf2      nmf3
0 -1.319228 -0.388168  0.483747
1 -0.080735 -2.190099  0.594459
2 -0.132196  0.786491  0.715910
3 -0.233909 -1.365364  0.156663
4 -0.207253  0.880211 -0.156841

See this colab notebook for a short example and some more thoughts. (We don't really talk about how this will be implemented in the notebook; if someone's interested in that or wants to help us (we'll have to change some Pandas Source Code 😎), just comment below and we can prepare another notebook with some pseudocode of how we want to integrate this)

Answer 2 · 2020-08-14T14:44:11.000Z

Here's another a little more detailed summary of the subcolumns approach:

New Hero VectorDF

When calculating any kind of mathematical representation
of our documents (e.g. through embed, tfidf, pca, ...)
we need to store these in a Pandas DataFrame / Series.

Before the new VectorDF, we have done this by simply storing
lists in cells, e.g. like this:

                        nmf
0                       [0.0, 0.0, 0.548022198777919]
1                       [1.6801770114633772, 0.0, 0.0]
2                       [1.678910685857782, 0.0, 0.00015405496961668698]
3                       [0.0, 2.1819904843216302, 0.0]
dtype: object

This has two main disadvantages:

- Performance: storing complex structures like lists is the opposite
               of what Pandas is designed to do (and where it excels
               at). Essentially, Pandas _does not understand_ what
               we're storing in the `nmf` column (which is why the
               dtype is `object`).

- User Experience: the Series above just does not look nice for
                   users; it provides no intuition as to what
                   it's about.

For the reasons above (and some other small ones), we switch to
the new VectorDF. Now, we separate every entry of the vector/list
to an own subcolumn. Example:

        nmf                    
       nmf1      nmf2      nmf3
0 -1.319228 -0.388168  0.483747
1 -0.080735 -2.190099  0.594459
2 -0.132196  0.786491  0.715910
3 -0.233909 -1.365364  0.156663
4 -0.207253  0.880211 -0.156841

This preserves the atomicity in the cells that Pandas is designed
for; we can now do vectorized operations on the columns etc.; no more
object datatype! It's also much more intuitive for users: they
can immediately see that what they get from hero.nmf looks like
a Matrix (of course it does not just look like one, it is a matrix - that's
why it's called non-negative matrix factorization).

We believe this is a major step forward for texthero, delivering
performance enhancements and a cleaner and more intuitive UX.

Key challenges:

- Integration into the existing Pandas ecosystem.
  We will need to ensure that users get no problems
  working with the subcolumns, as Pandas has not done
  too much to support this.

- Integration into Texthero:
    - integration with other types. How will they fit in / be implemented?
    - integration in the different modules.

Answer 3 · 2020-08-14T14:53:57.000Z

It looks great! 👍

We should just fully understand why the sub-column functionality as we see it does not work out of the box with the Pandas API.

Once understood which changes we should make in the Pandas code to make it compatible with our vision, we should ask ourselves if that's the right solution. We can also open a new issue on their Github profile and ask for clarifications, advice. As you said, this is a major step forward for Texthero and that's why is even more important to do it right.

I believe we should start with the idea we want to find a solution that works without the need of changing the Pandas API.

Follow up question:

Can you clarify what changes to the Pandas API you think are necessary and why?

Answer 4 · 2020-08-14T15:51:03.000Z

Yes, I'll explain some more and ask in the Pandas community. The gist of it is that we only need to make changes because users want to do this: df["pca"] = hero.pca(df["texts"]) (sorry for the bad example, of course you can't apply pca straight to the texts, but I hope it's clear. See the next comment.

Answer 5 · 2020-08-14T16:21:36.000Z

Pandas Trouble

Only issue we see: Integration with df["pca"] = hero.pca(df["texts"])

It's really important that users can seamlessly integrate texthero's function
output with their code. Let's assume a user has his documents in a DataFrame
df["texts"] that looks like this:

>>> df = pd.DataFrame(["Text of doc 1", "Text of doc 2", "Text of doc 3"], columns=["text"])
>>> df
            text
0  Text of doc 1
1  Text of doc 2
2  Text of doc 3

Let's look at an example output that hero.pca could
return with the new type:

>>> hero.pca(df["texts"])
        pca          
       pca1      pca2
0  0.754675  1.868685
1 -1.861651 -0.048236
2 -0.797750  0.388400

(you can generate a mock output like this e.g. with
pd.DataFrame(np.random.normal(size=(6,)).reshape((3,2)), columns=pd.MultiIndex.from_product([['pca'], ["pca1", "pca2"]])))

That's a DataFrame. Great! Of course, users can
just store this somewhere as e.g. df_pca = hero.pca(df["texts"]),
and that works great. Accessing is then also as always: to get the
pca values, they can just do df_pca.values and have the pca matrix
right there!

However, what we see really often is users wanting to do this:
df["pca"] = hero.pca(df["texts"]). This sadly does not work out
of the box. The reason is that this subcolumn type is implemented
internally through a Multiindex in the columns. So we have

>>> df.columns
Index(['text'], dtype='object')
>>> hero.pca(df["texts"]).columns
MultiIndex([('pca', 'pca1'), ('pca', 'pca2')])

Pandas cannot automatically combine these. So what we will
do is this: Calling df["pca"] = hero.pca(df["texts"]) is
internally this: pd.DataFrame.__setitem__(self=df, key="pca", value=hero.pca(df["texts"])).
We will overwrite this method so that if self is not multiindexed yet
and value is multiindexed, we transform self (so df here) to
be multiindexed and we can then easily integrate our column-multiindexed output from texthero:

If df is multiindexed, we get the desired result through pd.concat([df, hero.pca(df["texts"])], axis=1).

Pseudocode (& real code): working on this atm 🥉

Advantages / Why does this work?

- we don't destroy any pandas functionality as currently calling
  `__setitem__` with a Multiindexed value is just not possible, so
  our changes to Pandas do not break any Pandas functionality for
  the users. We're only _expanding_ the functinoality

- after multiindexing, users can still access their
  "normal" columns like before; e.g. `df["texts"]` will
  behave the same way as before even though it is now internally
  multiindexed as `MultiIndex([('pca', 'pca1'), ('pca', 'pca2'), ('texts', '')])`.

Answer 6 · 2020-08-14T16:49:49.000Z

Great!

This solves our initial problem, i.e assign a Dataframe to a Series (basically that's it, the source of all our problems).

Ok, so this code will work just by updating __setitem__? That's great!

1. df = pd.Dataframe(...)
2. df["text"] = pd.Series(["a","b","c"])
3. df["pca"] = hero.pca(df.text)

My question is: the two DataFrame at lines 1. and 3. will have different column-indexes (is there a better name), right? At line 1. (and 2. as well) the indexes will be flatten, whereas in 3. they will be MultiIndex. How do you deal with that? Updating __setitem__ is enough and by doing value=pd.DataFrame Pandas automatically create a MultiIndex on the columns? Magic.

Quick thought: calling MultiIndexing an action that actually makes changes on columns might be quite confusing for users (even if that's not our fault ...), we will need to figure out how we can explain to users what's really happening under the hood in a clear fashion and we might want to figure out an abstraction mechanism similar to the HeroSeries.

Regarding the TokenSeries discussion, I believe we can apply the same solution as pca and so on (i.e return a DataFrame). The reason is that, conceptually, we can represent it as a PandasRepresentation, the first index being the document, the second level the token (token-1, token-2, ...) and the value is the token itself ("a", "b"), ... As we can represent it sparse, we don't have the issue of memory consumption and the main advantage is that we have the atomicity and we can leverage parallelization (Dask will do the job, solving our second issue, i.e how to parallelize things)

Answer 7 · 2020-08-14T19:18:59.000Z

At line 1. (and 2. as well) the indexes will be flatten, whereas in 3. they will be MultiIndex. How do you deal with that? Updating setitem is enough and by doing value=pd.DataFrame Pandas automatically create a MultiIndex on the columns? Magic.

It's not really magic 😕 🧙‍♂️ , we do have to switch the DF to MultiIndex in the code. However, users won't notice this, they can still to df["texts"] and it will behave exactly like before.

we will need to figure out how we can explain to users what's really happening under the hood in a clear fashion and we might want to figure out an abstraction mechanism similar to the HeroSeries

Yes, we should explain. However as the "normal" columns like "texts" still behave like before this should be a fairly quick explanation with no "consequences" for the users.

See HERE for a discussion on the Pandas github. We were able to reduce the code we need to change to 11 lines 🎉 (and lots of comments). See here for the code and a few examples at the bottom.

Regarding the TokenSeries discussion, I believe we can apply ...

Yes, we'll think about that:

TODO:
- improve documentation for what we're doing
- test edge cases
- think about what you wrote above (TokenSeries, RepresentationSeries)
- integration into texthero

Answer 8 · 2020-08-15T10:02:25.000Z

First thoughts about the RepresentationSeries integration with the new VectorDF (we're just using that name for now, might think of a better one in the future).

Options:

Keep RepresentationSeries as output from tfidf, count, term_frequency. Will need RepresentationSeries-input-support for dim. reduction, clustering functions.
➕ Easy, have the code, good solution
➖ No DataFrame-Integration, users will have to use flatten (which we'll change to be similar to unstack for RepresentationSeries -> DataFrame)
Throw out RepresentationSeries completely and store the whole document-term matrix in a df, probably by default limiting max_number=300.
➕ Looks good integrated into DF; easier for other functions as we don't have another Series type that dim reduction, clustering need to support
➖ Tradeoff between information and storage as we don't profit from sparseness: users won't be able to get a whole representation of their large data
Experimental: tfidf, count, term_frequency give us a sparse matrix. We could make it seem for users like it is in a column in their DataFrame but internally only store pointers to the actual sparse matrix. They'd thus have the best of both worlds: profit from sparseness to get the whole document-term matrix and also keep everything in their DataFrame. See the attached photo for a mock-up.

Not a big fan of 2.

Big fan of trying out three (we'll try around with this today some more) as it's the most interesting from a Software-Engineering-Perspective and also best for the users, but there might some roadblocks we don't yet see. If 3. doesn't work, 1. seems good.

Sketch of option 3:

Answer 9 · 2020-08-15T14:10:12.000Z

Full fan of 3. The main question is: can we store a Sparse Pandas DataFrame into a mixed (sparse and not sparse) MultiIndex DataFrame?

Look forward to seeing what you came up with!

Answer 10 · 2020-08-16T11:38:01.000Z

(Not so positive) update:

We have sadly now noticed this (so what we're doing not just with Sparse stuff but overall in this issue is maybe not a viable solution after all 😕):

So our main issue is that we want to

store a matrix in a DataFrame that looks nice, so not just one row per cell but rather one entry per cell (which we can achieve through the approach above with "Subcolumns")
and allow users to place this in their DataFrame with df["pca"] = ....

The problem we're now facing with our implementation:

When inserting a big matrix, so a DF with maybe 1000 subcolumns, pandas starts acting weird due to its block manager. See HERE for a great introduction to the topic. We're basically looking for a way to performantly add many many columns to a DF.

Two things happen:

Pandas tries to consolidate columns of the same dtype
into "blocks", which requires copying data around.
If we now insert 5000 new columns, all the data has
to be copied instead of just referenced.
Weirdly, when doing
```
>>> x = np.random.normal(size=(10000, 5000))
>>> df_x = pd.DataFrame(x)
>>> y = np.random.normal(size=(10000, 5000))
>>> df_y = pd.DataFrame(y, columns=np.arange(5000, 1000))

>>> df_x[df_y.columns] = df_y
```
internally when looking at the blocks, pandas has one block
for the first 5k columns, and then one block for each single
column in the next 5k columns, so 5k blocks (we can
see this by looking at df_x._data).

So our actual issue seems to be the block manager that is not
designed for a use case with thousands of columns and forces
pandas to copy data around.

We're investigating this 🕵️‍♂️ 🔦

Answer 11 · 2020-08-16T15:44:11.000Z

Another update:

We did not get much further. We will still:

try around to maybe make our VectorDF solution more performant (although it seems unlikely this succeeds)
benchmark our current implementation with lists against our new improved "VectorDF" solution -> then decide on which to use
benchmark our current representationseries Vs the new sparse stuff in VectorDF -> then decide which to use

It could be the case that our subcolumn solution is just not viable as pandas is not designed to handle insertion of a large number of columns. So our output of hero.pca(s) would be a beautiful subcolumned DataFrame but inserting it into their own DF would take forever for users 😞

Answer 12 · 2020-08-16T16:31:23.000Z

Seems like something odd is going on. I tried to replicate your example above, but had to make modifications to get it to work:

import itertools as it

x = np.random.normal(size=(10000, 5000))
df_x = pd.DataFrame(x)
y = np.random.normal(size=(10000, 5000))
df_y = pd.DataFrame(y, columns=pd.MultiIndex.from_tuples(it.product(np.arange(1000), [0, 1, 2, 3, 4])))

I believe this is the form of the inputs you're going for, let me know if this is wrong.

Then doing the operation df_x[df_y.columns] = df_y on my machine takes 14 seconds and produces a single block as seen by running df_x._data:

BlockManager
Items: MultiIndex([(  0, ''),
            (  1, ''),
            (  2, ''),
            (  3, ''),
            (  4, ''),
            (  5, ''),
            (  6, ''),
            (  7, ''),
            (  8, ''),
            (  9, ''),
            ...
            (998,  0),
            (998,  1),
            (998,  2),
            (998,  3),
            (998,  4),
            (999,  0),
            (999,  1),
            (999,  2),
            (999,  3),
            (999,  4)],
           length=10000)
Axis 1: RangeIndex(start=0, stop=10000, step=1)
FloatBlock: slice(0, 10000, 1), 10000 x 10000, dtype: float64

Answer 13 · 2020-08-16T16:59:45.000Z

@rhshadrach thanks a lot for the post. We are sorry, but we didn't have linked our correced pandas method here. We overwrote the setter method to replace this action: df_x[df_y.columns] = df_y
Hower you just nailed our current problem we are having.

on my machine takes 14 seconds and produces a single block as seen by running df_x._data:

the action to insert the data takes a lot of time. We had a small look at the blockmanager, but it didn't seemed to be the problem, because even without rearanging the block, we still had a huge delay. 😿

Answer 14 · 2020-08-16T17:56:58.000Z

Thank you Henri, Max, and Richard for your great comments,

@mk2510 and @henrifroese, how do you explain your solution takes forever and produces more blocks whereas @rhshadrach solution takes "only" 14 seconds on a single block? It might be related to the way you overwrite the setter method or I'm missing something?

Answer 15 · 2020-08-16T18:29:52.000Z

Hmm maybe we're overengineering a little 🐙 for us 14 seconds is forever. We will do many performance comparisons tomorrow, maybe we're actually not worse (or even better) than the current solution, we'll see 🤞

Answer 16 · 2020-08-16T18:33:19.000Z

All right! Just one thing, what do you mean with "14 seconds is forever"?

For the sake of comparison, you might compare how long it takes to assign a (normal) Series with the same number of values as the matrix to a (normal) Pandas DF. It the size is veery big, it might be okay to wait for a sec

Answer 17 · 2020-08-16T18:46:00.000Z

Yess exactly, for us in some tests late today we got results showing that our new implementation takes several times as long as the current texthero implementation with lists in cells. We want to make sure we have solid reproducible comparisons so we'll spend more time with that.

We will compare

Current implementation with lists
new implementation from above
"advanced" new implementation where we change some more stuff (slicing, value copying, block consolidation) in pandas to speed things up for our specific use case of "putting a matrix into a DF"

And later on compare with the extremely high-dimensional sparse stuff.

Answer 18 · 2020-08-18T15:32:42.000Z

After some more discussions, we have come to the following conclusions:

Conclusions:

Keep VectorSeries and TokenSeries as-is. Reason: inserting into existing DataFrames is too expensive for users
Change RepresentationSeries output of tfidf, count, term_frequency to new sparse VectorDF that we will henceforth call DocumentTermDF. Reason: it looks nicer and is just as performant
All Dimensionality Reduction Functions and Clustering Functions will support both DocumentTermDF and VectorSeries input