希望增加group_by之后的统计函数？

Question

希望增加group_by之后的统计函数？

linearhinos opened this issue 7 years ago · 4 comments

在对key group_by之后，希望可以方便做求均值，求方差，排序再遍历这样的操作；
希望可以提供类似这样的内置函数

Answer 1 · 2017-12-25T08:57:52.000Z

I can not understand what you mean。

Answer 2 · 2017-12-25T09:17:05.000Z

i mean, in addition to sum(), count(), could bigflow support mean()/variance() and other popular statistical function for PCollection ?

Answer 3 · 2017-12-25T09:23:05.000Z

Actually, you can use:

def mean(p):
    return p.sum() / p.count()   
    # this is a sugar for p.sum().map(lambda s, c: s / c, p.count())

to implement mean in one line.

then, you can use them in apply_values,
e.g.

p.group_by_key()\
  .apply_values(mean)

At the same time, if you want to use it to a global pcollection, you can just use apply:

p.apply(mean)

or just call it directly:

mean(p)

Because it's easy to implement these functions, so we don't regard them as built-in methods.

If you find it difficult to write these functions, you can always use transforms.make_tuple(pobject1, pobject2).
E.g. You can use transforms.make_tuple to implement mean like this:

def mean(p):
    return transforms.make_tuple(p.sum(), p.count()).map(lambda (s, c): s/c)

And you can implement a method to get both sum and mean, and use them in apply_values like this:

def sum_and_mean(p):
    return transforms.make_tuple(p.sum(), p.apply(mean))

p.group_by_key().apply_values(sum_and_mean)

Answer 4 · 2017-12-31T14:56:36.000Z

I think there should be a module to provide available or useful functions.