Can we avoid having a cell with a list?
jbesomi opened this issue ยท 18 comments
As we know, it's really not recommended to store a list in a Pandas
cell. TokenSeries
and VectorSeries
, two of the core ideas of (current) Texthero are actually using it, can this process be avoided?
Need to discuss:
- Alternatives using sub-columns (it's still MultiIndex). Understand how complex and flexible this solution is. 99% of the cases, the standard Pandas/Texthero user does not really know how to work with MultiIndex ...
- Can we just use RepresentationSeries? Probably not as we cannot merge it into a
DataFrame
with a single index, other alternatives than data alignment with reindex (too complicated)?
As discussed, @mk2510 and I are fans of the sub-column approach. In short, instead of
nmf
0 [0.0, 0.0, 0.548022198777919]
1 [1.6801770114633772, 0.0, 0.0]
2 [1.678910685857782, 0.0, 0.00015405496961668698]
3 [0.0, 2.1819904843216302, 0.0]
dtype: object
which is really inefficient and hard to understand for users, we put the Multiindex in the columns and get this:
nmf
nmf1 nmf2 nmf3
0 -1.319228 -0.388168 0.483747
1 -0.080735 -2.190099 0.594459
2 -0.132196 0.786491 0.715910
3 -0.233909 -1.365364 0.156663
4 -0.207253 0.880211 -0.156841
See this colab notebook for a short example and some more thoughts. (We don't really talk about how this will be implemented in the notebook; if someone's interested in that or wants to help us (we'll have to change some Pandas Source Code ๐), just comment below and we can prepare another notebook with some pseudocode of how we want to integrate this)
Here's another a little more detailed summary of the subcolumns approach:
New Hero VectorDF
When calculating any kind of mathematical representation
of our documents (e.g. through embed, tfidf, pca, ...
)
we need to store these in a Pandas DataFrame / Series.
Before the new VectorDF, we have done this by simply storing
lists in cells, e.g. like this:
nmf
0 [0.0, 0.0, 0.548022198777919]
1 [1.6801770114633772, 0.0, 0.0]
2 [1.678910685857782, 0.0, 0.00015405496961668698]
3 [0.0, 2.1819904843216302, 0.0]
dtype: object
This has two main disadvantages:
- Performance: storing complex structures like lists is the opposite
of what Pandas is designed to do (and where it excels
at). Essentially, Pandas _does not understand_ what
we're storing in the `nmf` column (which is why the
dtype is `object`).
- User Experience: the Series above just does not look nice for
users; it provides no intuition as to what
it's about.
For the reasons above (and some other small ones), we switch to
the new VectorDF. Now, we separate every entry of the vector/list
to an own subcolumn. Example:
nmf
nmf1 nmf2 nmf3
0 -1.319228 -0.388168 0.483747
1 -0.080735 -2.190099 0.594459
2 -0.132196 0.786491 0.715910
3 -0.233909 -1.365364 0.156663
4 -0.207253 0.880211 -0.156841
This preserves the atomicity in the cells that Pandas is designed
for; we can now do vectorized operations on the columns etc.; no more
object
datatype! It's also much more intuitive for users: they
can immediately see that what they get from hero.nmf
looks like
a Matrix (of course it does not just look like one, it is a matrix - that's
why it's called non-negative matrix factorization).
We believe this is a major step forward for texthero, delivering
performance enhancements and a cleaner and more intuitive UX.
Key challenges:
- Integration into the existing Pandas ecosystem.
We will need to ensure that users get no problems
working with the subcolumns, as Pandas has not done
too much to support this.
- Integration into Texthero:
- integration with other types. How will they fit in / be implemented?
- integration in the different modules.
It looks great! ๐
We should just fully understand why the sub-column functionality as we see it does not work out of the box with the Pandas API.
Once understood which changes we should make in the Pandas code to make it compatible with our vision, we should ask ourselves if that's the right solution. We can also open a new issue on their Github profile and ask for clarifications, advice. As you said, this is a major step forward for Texthero and that's why is even more important to do it right.
I believe we should start with the idea we want to find a solution that works without the need of changing the Pandas API.
Follow up question:
Can you clarify what changes to the Pandas API you think are necessary and why?
Yes, I'll explain some more and ask in the Pandas community. The gist of it is that we only need to make changes because users want to do this: df["pca"] = hero.pca(df["texts"])
(sorry for the bad example, of course you can't apply pca
straight to the texts, but I hope it's clear. See the next comment.
Pandas Trouble
Only issue we see: Integration with df["pca"] = hero.pca(df["texts"])
It's really important that users can seamlessly integrate texthero's function
output with their code. Let's assume a user has his documents in a DataFrame
df["texts"]
that looks like this:
>>> df = pd.DataFrame(["Text of doc 1", "Text of doc 2", "Text of doc 3"], columns=["text"])
>>> df
text
0 Text of doc 1
1 Text of doc 2
2 Text of doc 3
Let's look at an example output that hero.pca
could
return with the new type:
>>> hero.pca(df["texts"])
pca
pca1 pca2
0 0.754675 1.868685
1 -1.861651 -0.048236
2 -0.797750 0.388400
(you can generate a mock output like this e.g. with
pd.DataFrame(np.random.normal(size=(6,)).reshape((3,2)), columns=pd.MultiIndex.from_product([['pca'], ["pca1", "pca2"]]))
)
That's a DataFrame. Great! Of course, users can
just store this somewhere as e.g. df_pca = hero.pca(df["texts"])
,
and that works great. Accessing is then also as always: to get the
pca values, they can just do df_pca.values
and have the pca matrix
right there!
However, what we see really often is users wanting to do this:
df["pca"] = hero.pca(df["texts"])
. This sadly does not work out
of the box. The reason is that this subcolumn type is implemented
internally through a Multiindex in the columns. So we have
>>> df.columns
Index(['text'], dtype='object')
>>> hero.pca(df["texts"]).columns
MultiIndex([('pca', 'pca1'), ('pca', 'pca2')])
Pandas cannot automatically combine these. So what we will
do is this: Calling df["pca"] = hero.pca(df["texts"])
is
internally this: pd.DataFrame.__setitem__(self=df, key="pca", value=hero.pca(df["texts"]))
.
We will overwrite this method so that if self is not multiindexed yet
and value is multiindexed, we transform self (so df
here) to
be multiindexed and we can then easily integrate our column-multiindexed output from texthero:
If df
is multiindexed, we get the desired result through pd.concat([df, hero.pca(df["texts"])], axis=1)
.
Pseudocode (& real code): working on this atm ๐ฅ
Advantages / Why does this work?
- we don't destroy any pandas functionality as currently calling
`__setitem__` with a Multiindexed value is just not possible, so
our changes to Pandas do not break any Pandas functionality for
the users. We're only _expanding_ the functinoality
- after multiindexing, users can still access their
"normal" columns like before; e.g. `df["texts"]` will
behave the same way as before even though it is now internally
multiindexed as `MultiIndex([('pca', 'pca1'), ('pca', 'pca2'), ('texts', '')])`.
Great!
This solves our initial problem, i.e assign a Dataframe to a Series (basically that's it, the source of all our problems).
Ok, so this code will work just by updating __setitem__? That's great!
1. df = pd.Dataframe(...)
2. df["text"] = pd.Series(["a","b","c"])
3. df["pca"] = hero.pca(df.text)
My question is: the two DataFrame at lines 1. and 3. will have different column-indexes (is there a better name), right? At line 1. (and 2. as well) the indexes will be flatten, whereas in 3. they will be MultiIndex. How do you deal with that? Updating __setitem__ is enough and by doing value=pd.DataFrame
Pandas automatically create a MultiIndex on the columns? Magic.
Quick thought: calling MultiIndexing an action that actually makes changes on columns might be quite confusing for users (even if that's not our fault ...), we will need to figure out how we can explain to users what's really happening under the hood in a clear fashion and we might want to figure out an abstraction mechanism similar to the HeroSeries
.
Regarding the TokenSeries
discussion, I believe we can apply the same solution as pca
and so on (i.e return a DataFrame). The reason is that, conceptually, we can represent it as a PandasRepresentation
, the first index being the document, the second level the token (token-1, token-2, ...) and the value is the token itself ("a", "b"), ... As we can represent it sparse, we don't have the issue of memory consumption and the main advantage is that we have the atomicity and we can leverage parallelization (Dask will do the job, solving our second issue, i.e how to parallelize things)
At line 1. (and 2. as well) the indexes will be flatten, whereas in 3. they will be MultiIndex. How do you deal with that? Updating setitem is enough and by doing value=pd.DataFrame Pandas automatically create a MultiIndex on the columns? Magic.
It's not really magic ๐ ๐งโโ๏ธ , we do have to switch the DF to MultiIndex in the code. However, users won't notice this, they can still to df["texts"]
and it will behave exactly like before.
we will need to figure out how we can explain to users what's really happening under the hood in a clear fashion and we might want to figure out an abstraction mechanism similar to the HeroSeries
Yes, we should explain. However as the "normal" columns like "texts" still behave like before this should be a fairly quick explanation with no "consequences" for the users.
See HERE for a discussion on the Pandas github. We were able to reduce the code we need to change to 11 lines ๐ (and lots of comments). See here for the code and a few examples at the bottom.
Regarding the TokenSeries discussion, I believe we can apply ...
Yes, we'll think about that:
TODO:
- improve documentation for what we're doing
- test edge cases
- think about what you wrote above (TokenSeries, RepresentationSeries)
- integration into texthero
First thoughts about the RepresentationSeries integration with the new VectorDF (we're just using that name for now, might think of a better one in the future).
Options:
-
Keep RepresentationSeries as output from
tfidf, count, term_frequency
. Will need RepresentationSeries-input-support fordim. reduction, clustering
functions.
โ Easy, have the code, good solution
โ No DataFrame-Integration, users will have to useflatten
(which we'll change to be similar tounstack
for RepresentationSeries -> DataFrame) -
Throw out RepresentationSeries completely and store the whole document-term matrix in a df, probably by default limiting
max_number=300
.
โ Looks good integrated into DF; easier for other functions as we don't have another Series type thatdim reduction, clustering
need to support
โ Tradeoff between information and storage as we don't profit from sparseness: users won't be able to get a whole representation of their large data -
Experimental:
tfidf, count, term_frequency
give us a sparse matrix. We could make it seem for users like it is in a column in their DataFrame but internally only store pointers to the actual sparse matrix. They'd thus have the best of both worlds: profit from sparseness to get the whole document-term matrix and also keep everything in their DataFrame. See the attached photo for a mock-up.
Not a big fan of 2.
Big fan of trying out three (we'll try around with this today some more) as it's the most interesting from a Software-Engineering-Perspective and also best for the users, but there might some roadblocks we don't yet see. If 3. doesn't work, 1. seems good.
Sketch of option 3:
Full fan of 3. The main question is: can we store a Sparse Pandas DataFrame into a mixed (sparse and not sparse) MultiIndex DataFrame?
Look forward to seeing what you came up with!
(Not so positive) update:
We have sadly now noticed this (so what we're doing not just with Sparse stuff but overall in this issue is maybe not a viable solution after all ๐):
So our main issue is that we want to
- store a matrix in a DataFrame that looks nice, so not just one row per cell but rather one entry per cell (which we can achieve through the approach above with "Subcolumns")
- and allow users to place this in their DataFrame with
df["pca"] = ...
.
The problem we're now facing with our implementation:
When inserting a big matrix, so a DF with maybe 1000 subcolumns, pandas starts acting weird due to its block manager. See HERE for a great introduction to the topic. We're basically looking for a way to performantly add many many columns to a DF.
Two things happen:
-
Pandas tries to consolidate columns of the same dtype
into "blocks", which requires copying data around.
If we now insert 5000 new columns, all the data has
to be copied instead of just referenced. -
Weirdly, when doing
>>> x = np.random.normal(size=(10000, 5000)) >>> df_x = pd.DataFrame(x) >>> y = np.random.normal(size=(10000, 5000)) >>> df_y = pd.DataFrame(y, columns=np.arange(5000, 1000)) >>> df_x[df_y.columns] = df_y
internally when looking at the blocks, pandas has one block
for the first 5k columns, and then one block for each single
column in the next 5k columns, so 5k blocks (we can
see this by looking atdf_x._data
).
So our actual issue seems to be the block manager that is not
designed for a use case with thousands of columns and forces
pandas to copy data around.
We're investigating this ๐ต๏ธโโ๏ธ ๐ฆ
Another update:
We did not get much further. We will still:
- try around to maybe make our VectorDF solution more performant (although it seems unlikely this succeeds)
- benchmark our current implementation with lists against our new improved "VectorDF" solution -> then decide on which to use
- benchmark our current representationseries Vs the new sparse stuff in VectorDF -> then decide which to use
It could be the case that our subcolumn solution is just not viable as pandas is not designed to handle insertion of a large number of columns. So our output of hero.pca(s)
would be a beautiful subcolumned DataFrame but inserting it into their own DF would take forever for users ๐
Seems like something odd is going on. I tried to replicate your example above, but had to make modifications to get it to work:
import itertools as it
x = np.random.normal(size=(10000, 5000))
df_x = pd.DataFrame(x)
y = np.random.normal(size=(10000, 5000))
df_y = pd.DataFrame(y, columns=pd.MultiIndex.from_tuples(it.product(np.arange(1000), [0, 1, 2, 3, 4])))
I believe this is the form of the inputs you're going for, let me know if this is wrong.
Then doing the operation df_x[df_y.columns] = df_y
on my machine takes 14 seconds and produces a single block as seen by running df_x._data
:
BlockManager
Items: MultiIndex([( 0, ''),
( 1, ''),
( 2, ''),
( 3, ''),
( 4, ''),
( 5, ''),
( 6, ''),
( 7, ''),
( 8, ''),
( 9, ''),
...
(998, 0),
(998, 1),
(998, 2),
(998, 3),
(998, 4),
(999, 0),
(999, 1),
(999, 2),
(999, 3),
(999, 4)],
length=10000)
Axis 1: RangeIndex(start=0, stop=10000, step=1)
FloatBlock: slice(0, 10000, 1), 10000 x 10000, dtype: float64
@rhshadrach thanks a lot for the post. We are sorry, but we didn't have linked our correced pandas method here. We overwrote the setter method to replace this action: df_x[df_y.columns] = df_y
Hower you just nailed our current problem we are having.
on my machine takes 14 seconds and produces a single block as seen by running df_x._data:
the action to insert the data takes a lot of time. We had a small look at the blockmanager, but it didn't seemed to be the problem, because even without rearanging the block, we still had a huge delay. ๐ฟ
Thank you Henri, Max, and Richard for your great comments,
@mk2510 and @henrifroese, how do you explain your solution takes forever and produces more blocks whereas @rhshadrach solution takes "only" 14 seconds on a single block? It might be related to the way you overwrite the setter method or I'm missing something?
Hmm maybe we're overengineering a little ๐ for us 14 seconds is forever. We will do many performance comparisons tomorrow, maybe we're actually not worse (or even better) than the current solution, we'll see ๐ค
All right! Just one thing, what do you mean with "14 seconds is forever"?
For the sake of comparison, you might compare how long it takes to assign a (normal) Series with the same number of values as the matrix to a (normal) Pandas DF. It the size is veery big, it might be okay to wait for a sec
Yess exactly, for us in some tests late today we got results showing that our new implementation takes several times as long as the current texthero implementation with lists in cells. We want to make sure we have solid reproducible comparisons so we'll spend more time with that.
We will compare
- Current implementation with lists
- new implementation from above
- "advanced" new implementation where we change some more stuff (slicing, value copying, block consolidation) in pandas to speed things up for our specific use case of "putting a matrix into a DF"
And later on compare with the extremely high-dimensional sparse stuff.
After some more discussions, we have come to the following conclusions:
Conclusions:
- Keep
VectorSeries
andTokenSeries
as-is. Reason: inserting into existing DataFrames is too expensive for users - Change
RepresentationSeries
output oftfidf, count, term_frequency
to new sparseVectorDF
that we will henceforth callDocumentTermDF
. Reason: it looks nicer and is just as performant - All Dimensionality Reduction Functions and Clustering Functions will support both
DocumentTermDF
andVectorSeries
input