misoproject/dataset

groupBy method should allow more control over method option

adambom opened this issue · 2 comments

I want to be able provide a function for each of the columns I specify. Let me provide an example.

The following code doesn't work because it just does sum on the year column.

ds.where(function (row) {
    return row.year === 2010;
}).groupBy('race', ['population', 'year']).columns(['year', 'race', 'population']).toJSON();

I end up with a view that looks like this:

[ { _id: 39155,
    race: 'White',
    population: 77036790.6869098,
    year: 643200 },
  { _id: 39157,
    race: 'Black',
    population: 11941573.990469573,
    year: 643200 },
  { _id: 39159,
    race: 'Asian',
    population: 5038865.1221144805,
    year: 643200 },
  { _id: 39161,
    race: 'Hispanic',
    population: 13029068.243787775,
    year: 643200 },
  { _id: 39163,
    race: 'Other',
    population: 6686024.6175516695,
    year: 643200 } ]

Notice how the value for year is not 2010, which is what I want.

I'd like to propose an update to the method option in groupBy that allows me to pass an object with key being the column I'm aggregating by, and value being a function that determines the aggregation method. If none is specified, you get sum. I could rewrite my example above like so:

ds.where(function (row) {
    return row.year === 2010;
}).groupBy('race', ['population', 'year'], {
    method: {
        year: _.identity
        population: _.sum // optional, but showing here to be explicit
    }
})
.columns(['year', 'race', 'population']).toJSON();

This would return:

[ { _id: 39155,
    race: 'White',
    population: 77036790.6869098,
    year: 2010 },
  { _id: 39157,
    race: 'Black',
    population: 11941573.990469573,
    year: 2010 },
  { _id: 39159,
    race: 'Asian',
    population: 5038865.1221144805,
    year: 2010 },
  { _id: 39161,
    race: 'Hispanic',
    population: 13029068.243787775,
    year: 2010 },
  { _id: 39163,
    race: 'Other',
    population: 6686024.6175516695,
    year: 2010 } ]

Whatcha think? I'm open to other approaches too. What I'm trying to emulate is this SQL

SELECT year, race, sum(population) 
FROM mytable 
WHERE year=2010 
GROUP BY race;
iros commented

Could you give me a sample of your data? I'm not really sure why the year
is getting converted to something strange. Is it being added up?

-- Irene

On Thu, Jun 13, 2013 at 4:57 PM, Adam Savitzky notifications@github.comwrote:

I want to be able provide a function for each of the columns I specify.
Let me provide an example.

The following code doesn't work because it just does sum on the year
column.

ds.where(function (row) {
return row.year === 2010;}).groupBy('race', ['population', 'year']).columns(['year', 'race', 'population']).toJSON();

I end up with a view that looks like this:

[ { _id: 39155,
race: 'White',
population: 77036790.6869098,
year: 643200 },
{ _id: 39157,
race: 'Black',
population: 11941573.990469573,
year: 643200 },
{ _id: 39159,
race: 'Asian',
population: 5038865.1221144805,
year: 643200 },
{ _id: 39161,
race: 'Hispanic',
population: 13029068.243787775,
year: 643200 },
{ _id: 39163,
race: 'Other',
population: 6686024.6175516695,
year: 643200 } ]

Notice how the value for year is not 2010, which is what I want.

I'd like to propose an update to the method option in groupBy that allows
me to pass an object with key being the column I'm aggregating by, and
value being a function that determines the aggregation method. If none is
specified, you get sum. I could rewrite my example above like so:

ds.where(function (row) {
return row.year === 2010;}).groupBy('race', ['population', 'year'], {
method: {
year: _.identity
population: _.sum // optional, but showing here to be explicit
}}).columns(['year', 'race', 'population']).toJSON();

This would return:

[ { _id: 39155,
race: 'White',
population: 77036790.6869098,
year: 2010 },
{ _id: 39157,
race: 'Black',
population: 11941573.990469573,
year: 2010 },
{ _id: 39159,
race: 'Asian',
population: 5038865.1221144805,
year: 2010 },
{ _id: 39161,
race: 'Hispanic',
population: 13029068.243787775,
year: 2010 },
{ _id: 39163,
race: 'Other',
population: 6686024.6175516695,
year: 2010 } ]

Whatcha think? I'm open to other approaches too. What I'm trying to
emulate is this SQL

SELECT year, race, sum(population) FROM mytable WHERE year=2010 GROUP BY race;


Reply to this email directly or view it on GitHubhttps://github.com//issues/202
.

Yes, the year is being added up. It would be nice if there was a way to specify aggregators for all the columns, so this didin't happen.

The data looks like this:

year | race     | region | population
-------------------------------------
2010 | hispanic | south  | 100
2010 | hispanic | east   | 100
2010 | hispanic | west   | 100
2011 | hispanic | east   | 200
2011 | hispanic | south  | 200
2011 | hispanic | west   | 200
2012 | hispanic | west   | 300
2012 | hispanic | east   | 300
2012 | hispanic | south  | 300
2010 | white    | south  | 200
2010 | white    | east   | 200
2010 | white    | west   | 200
2011 | white    | east   | 300
2011 | white    | south  | 300
2011 | white    | west   | 300
2012 | white    | west   | 400
2012 | white    | east   | 400
2012 | white    | south  | 400
2010 | black    | south  | 150
2010 | black    | east   | 150
2010 | black    | west   | 150
2011 | black    | east   | 160
2011 | black    | south  | 160
2011 | black    | west   | 160
2012 | black    | west   | 170
2012 | black    | east   | 170
2012 | black    | south  | 170

What I'd like to get is data that looks like this:

year | race     | sum(population)
-----------------------------
2010 | hispanic | 300
2010 | white    | 600
2010 | black    | 450