/activerecord-collections

Create collections of records, represented by ActiveRecord::Relation query criteria

Primary LanguageRuby

ActiveRecord::Collections

An ActiveRecord::Collection can best be described as being somewhere between a model (extended from ActiveRecord::Base), it's ActiveRecord::Relation and an enumerable set of records. A collection wraps and delegates to the aforementioned objects, being smart about where to send method calls and executing minimal queries only when needed (and as infrequently as possible). The latter allows for some interesting features, like the ability to build a query using all your standard scopes and the model's relation object without executing it, and serializing the query criteria to be used in a background job (instead of plucking and passing record IDs, for example). Or being able to break a collection that contains many records into smaller batches (using limited/offset collections) and traverse through them without needing to query each batch until you want to work with it.

The implementation is nothing fancy or crazy, there is some heavy usage of delegation but not much beyond that, however I believe the concept here is very powerful. Aside from some of the benefits gained from batching, serialization and some of the other features, collections allow you to use a single object and interface to represent a set of records and use that object both to query and operate on those records.

Highlights

  • Makes life easier by eliminating common boilerplate code when querying and working with collections of records.
  • Smart about delegating methods to your model's relation object or a collection of records, with the ability to force delegation to one or the other inline.
  • Executes the fewest queries possible as late as possible so nothing is queried from your database until you're ready to work with the records (or can always load them).
  • Built-in batching and pagination makes it easy to break large result sets into smaller batches for faster querying, and makes processing batches concurrently in background jobs easy.
  • Ability to serialize query criteria and re-build a collection from serialized JSON. This allows you to build your query for records as a collection and pass it to a background job without having to query for object IDs inline in an HTTP request and pass those as job arguments, for example, or to store a dynamically built query for repeat/later use.

Lowlights

  • Be careful with method overlap and delegation! The collection prefers the ActiveRecord::Relation when delegating method calls, so if you have a method (maybe a scope) on your relation with the same name as a method (maybe an attribute) on your model, you'll want to make sure you use #on_records or #on_relation accordingly.
  • Because of the way the ActiveRecord::Collection behaves, it does not include the Enumerable module directly, and many common enumerable methods have not yet been implemented (like select, reject, etc.). If you need to use one of these methods you should call them on your collection of records directly, by grabbing an array of records with #to_a, or you can use #each or #map depending on your needs.
  • This was prototyped in and abstracted from the Instacart rails application, and specs have not yet been ported and filled out.

Basic Usage

Define a Collection

To define a collection, simply extend ActiveRecord::Collection. If your collection class name is the pluralized version of your model name, no additional configuration is necessary.

class Things < ActiveRecord::Collection
  # uses the singular class name, so in this case our collectable model is Thing
end

You can define the collectable model for your collection if you're using a custom class name.

class ThingCollection < ActiveRecord::Collection
  collectable Thing
end

Query a Collection

Once you've defined your collection class, you can start using it just like your model to query for collections of records:

Things.where(an_attribute: value).order(:other_attribute)

More information can be found throughout this documentation.

Act on Results

You can easily call methods against each of the records in your collection either by using the default dynamic delegation or forcing delegation with #on_records.

Things.where(attribute: value).sync_to_cache    # calls the Thing#sync_to_cache instance method on all the records in the collection
Things.where(attribute: value).other_attribute  # returns an array of other_attribute values (mapping the records if loaded, or plucking the attribute if not)

This becomes much more powerful when you take batching into account and consider the boilerplate code you save not having to manually iterate over each batch aggregating values or performing actions.

Delegation

The way collections do what they do is mostly through delegation. The majority of the class is just convenience methods for wrapping the objects it represents (model, relation, records) and sending your method calls to the right one. Many important methods (such as query chain methods like where, joins, etc.) have custom definitions on the collection that know exactly what to do, but in some cases your method calls will find themselves routed through method_missing, at which point the collection will do it's best to send it to the right place.

Collections prefer the ActiveRecord::Relation when doing dynamic delegation, so if the relation responds to that method it'll get called. If the relation doesn't respond to the method, the collection attempts to route the method call to the individual records, otherwise falls back to default behavior (almost always raising a NoMethodError).

Keep the delegation order above in mind when calling methods and use #on_records or #on_relation where appropriate so you don't accidentally apply a scope to your query instead of retrieving an array of attributes from your records for example.

#on_records

Temporarily routes all dynamic delegation to the records in the collection for inline method calls and blocks.

  collection.on_records.do_thing                # calls do_thing on each record in the collection
  attrs = collection.on_records.some_attribute  # returns the some_attribute value for each record in an array
  collection.on_records do
    do_thing
    # self context here is the collection, which in turn routes your call to each record in the collection,
    # making this essentially the same as the first example above
  end

#on_relation

Temporarily routes all dynamic delegation to the relation for inline method calls and blocks. This is used much more rarely than #on_records since the default delegation prefers the relation.

  # pretend you have an 'available' column/attribute AND an 'available' scope on your model
  collection.on_relation.available  # calls the available scope on the collection relation
  collection.available              # same as above, default delegation prefers the relation

Querying

For the most part querying with an ActiveRecord::Collection is just like querying with an ActiveRecord::Relation, which is what your model uses when you call something like Model.where.

The or method is the only query chain method with a slightly different signature:

  MyModel.where(something).or.where(other_thing)  # ActiveRecord::Relation
  MyCollection.where(something).or(other_thing)   # ActiveRecord::Collection

Other than that, the rest of the query chain behaves exactly the same, and you can use joins, includes, order, limit, where, not and others, along with any scopes defined on the model to build your query criteria.

Serialization

One of the most powerful features of active record collections is the ability to serialize your query criteria. Collections can be converted to/from a hash that describes the query criteria used to build the active record relation, and that hash can be converted to JSON and stored or passed around however you'd like.

Say you have a set of models and collections like these:

class Serial < ActiveRecord::Base
  has_many :games
end

class Publisher < ActiveRecord::Base
  has_many :games
end

class Developer < ActiveRecord::Base
  has_many :games
end

class Game < ActiveRecord::Base
  belongs_to :developer
  belongs_to :publisher
  belongs_to :serial

  scope :by_developer_id, -> (developer_ids) { where(developer_id: developer_ids) }
  scope :by_publisher_id, -> (publisher_ids) { where(publisher_id: publisher_ids) }
  scope :by_serial_id, -> (serial_ids) { where(serial_id: serial_ids) }
end

class Games < ActiveRecord::Collection
  protected

  def initialize(*criteria)
    super(Game, *criteria)
  end
end

We have game series (Serial), individual titles/releases (Games), publishers and developers. A game is part of a series, and in this simple example is always developed by one developer and published by one publisher.

Here's what the hash and JSON would look like when querying a game collection by publisher and series:

Games.by_publisher_id(1).by_serial_id(1).to_hash
# => {:select=>[], :distinct=>nil, :joins=>[], :references=>[], :includes=>[], :where=>["\"games\".\"publisher_id\" = $1", "\"games\".\"serial_id\" = $1"], :order=>[], :bind=>[{:name=>"publisher_id", :value=>1}, {:name=>"serial_id", :value=>1}]}
Games.by_publisher_id(1).by_serial_id(1).to_json
# => "{\"select\":[],\"distinct\":null,\"joins\":[],\"references\":[],\"includes\":[],\"where\":[\"\\\"games\\\".\\\"publisher_id\\\" = $1\",\"\\\"games\\\".\\\"serial_id\\\" = $1\"],\"order\":[],\"bind\":[{\"name\":\"publisher_id\",\"value\":1},{\"name\":\"serial_id\",\"value\":1}]}"

Now maybe you want to perform a bulk update against a collection of game records, and you expect it to be used to apply updates to large numbers of records at a time, triggered from a form or button in your web UI, so you decide to write a background job that will perform the update for you and send a notification when it's done.

You might do something like this:

class GamesController < ApplicationController
  def bulk_update
    UpdatePublisherSerialGamesJob.perform_later(publisher.id, serial.id)
    redirect_to :back
  end
end

class UpdatePublisherSerialGamesJob < ActiveJob::Base
  def perform(publisher_id, serial_id)
    Game.by_publisher_id(publisher_id).by_serial_id(serial_id).each do |game|
      # update each game
    end
  end
end

But what happens when you decide you want to update games that belong to a developer and serial, rather than publisher? You can make the arguments for the job a bit more dynamic, but you'll need to edit this job every time you have a new set of criteria for which you want to apply bulk updates.

You might switch to a job that accepts game_ids instead of criteria arguments:

class GamesController < ApplicationController
  def bulk_update
    UpdateGamesJob.perform_later(Game.by_publisher_id(publisher.id).by_serial_id(serial.id).pluck(:id))
    redirect_to :back
  end
end

class UpdateGamesJob < ActiveJob::Base
  def perform(game_ids)
    Game.where(id: game_ids).each do |game|
      # update each game
    end
  end
end

But now you have to pluck IDs from the database in order to queue up your job. This is admittedly not that heavy, but if you plan on processing large numbers of records we can do better!

class GamesController < ApplicationController
  def bulk_update
    UpdateGamesJob.perform_later(Games.by_publisher_id(publisher.id).by_serial_id(serial.id).to_json) # this is instant, and does not query the database
    redirect_to :back
  end
end

class UpdateGamesJob < ActiveJob::Base
  def perform(game_collection)
    Games.from_json(game_collection).each do |game|
      # update each game
    end
  end
end

Now you can queue up a job to process millions of records in just a few milliseconds, and you can pass any game collection (or even a batch) based on whatever criteria you want to your job!

The one thing to keep in mind when serializing and passing a collection around is that it's possible the records that match your criteria will change between the time you serialize and the time you use the collection. In some cases this is good - you catch newer records that wouldn't have been caught if you'd queried and passed IDs, or you want to re-execute a query and collect results over time. In some cases it can be bad - you might want to operate on a very specific set of rows, in which case you'd probably want to query by ID anyway. But in most common uses it's likely something you won't need to think about (it behaves just like the first example job above that accepts criteria arguments).

Batching

TODO

Iterating and Manipulating

TODO