data-forge/data-forge-ts

Improve documentation for `DataFrame.concat` in relation to index

mz8i opened this issue · 5 comments

mz8i commented

I recently ran into a problem where I had some very unexpected aggregation results that only showed up when I was working on a concatenated dataframe.
Only after quite a bit of debugging I realised this was to do with how DataFrame.concat() works in relation to the index, in that it just, well, concatenates the index, too.

You could say the behaviour is justifiable, though you could argue that a DataFrame with duplicated index values is actually malformed...?

But in any case, from the perspective of a newcomer to the library, some improvements to the documentation could help avoid such mistakes:

  1. First of all, the Concatenation section of the guide doesn't mention the index at all. I think this behaviour should be explicitly mentioned, because a lot of users will be simply using the default zero-based index, in which case concatenating two dataframes will lead to unexpected results, for example when running generateSeries later on.

Consider the following case:

const df1 = new DataFrame({
   columnNames: ['Column1', 'Column2'],
   rows: [[1, 10], [2, 20]]
});

const df2 = new DataFrame({
   columnNames: ['Column1', 'Column2'],
   rows: [[3, 30], [4, 40]]
});

const dfc = DataFrame.concate([df1, df2]);
// the index is now [0, 1, 0, 1]...

dfc.generateSeries({
   NewColumn: row => row.Column1 * 2
});
// the results will not be what one expects because of the duplicated index values
  1. The resetIndex method is not mentioned in the guide at all, it's only to be found in the API Reference. Given the above seems to be a relatively common use case and also presumably a common pitfall, I'd suggest adding a sentence on how to deal with concatenation of zero-indexed dataframes:
const dfc = DataFrame.concat([df1, df2]).resetIndex();

Looking through the issues, I think #142 was also affected by this topic so it seems that more people would use better docs on this topic.

Thanks!

Please feel free to make some improvements to the docs and submit a PR!

mz8i commented

OK, will do!
Regarding the concatenation behaviour, what do you think about the current behaviour when concatenating two zero-based indices? Should it be left as-is? Are there any use cases for a dataframe with duplicate index values, or should it be considered invalid?
I'm wondering if the behaviour should at least be that a warning is printed if the index contains duplicates after the concatenation. And possibly the index gets automatically reset in such a case?

I don't think the duplicated index is a problem, in fact I rely on the index being preserved especially when the index is a date - I don't want to lose the date attached to each row.

Feel free to add a warning to the documentation.

mz8i commented

I saw you closed this recently @ashleydavis - sorry for being slow about submitting a PR, I just took a stab at the wording in #164

Thanks so much for your contribution! I really appreciate it.