baidu/bigflow

希望增加group_by之后的统计函数?

linearhinos opened this issue · 4 comments

在对key group_by之后,希望可以方便做求均值,求方差,排序再遍历这样的操作;
希望可以提供类似这样的内置函数

I can not understand what you mean。

i mean, in addition to sum(), count(), could bigflow support mean()/variance() and other popular statistical function for PCollection ?

acmol commented

Actually, you can use:

def mean(p):
    return p.sum() / p.count()   
    # this is a sugar for p.sum().map(lambda s, c: s / c, p.count())

to implement mean in one line.

then, you can use them in apply_values,
e.g.

p.group_by_key()\
  .apply_values(mean)

At the same time, if you want to use it to a global pcollection, you can just use apply:

p.apply(mean) 

or just call it directly:

mean(p)

Because it's easy to implement these functions, so we don't regard them as built-in methods.

If you find it difficult to write these functions, you can always use transforms.make_tuple(pobject1, pobject2).
E.g. You can use transforms.make_tuple to implement mean like this:

def mean(p):
    return transforms.make_tuple(p.sum(), p.count()).map(lambda (s, c): s/c)

And you can implement a method to get both sum and mean, and use them in apply_values like this:

def sum_and_mean(p):
    return transforms.make_tuple(p.sum(), p.apply(mean))

p.group_by_key().apply_values(sum_and_mean)

I think there should be a module to provide available or useful functions.