ravangen/graphql-rate-limit

[Question] Fixed size time window vs. Query cost/complexity validation

Closed this issue · 6 comments

I'm interested in pros and cons in using fixed size time window vs. query cost/complexity validation for securing GraphQL API from malicious queries.

Is using fixed size time window approach enough (provides no less (equal? more?) protection than query cost/complexity validation)? Or is it better to use both?

E.g. depth and amount limiting provides insufficient protection (less than query cost/complexity validation). See https://blog.apollographql.com/securing-your-graphql-api-from-malicious-queries-16130a324a6b. And its use is enough (if you use query cost/complexity validation, no need to use the depth and amount limiting).

Is fixed size time window approach easier to implement than query cost/complexity validation?

E.g. the query cost/complexity validation requires to define how complex certain resolvers are via the performance tracking data exposed by Apollo Engine.

Hi @FluorescentHallucinogen,

Great question. I recommend using both query complexity (like graphql-cost-analysis and rate limiting (defense in depth). There is overlap in general purpose of protecting from malicious actors.

Query complexity tools focus on static analysis, applied as a validation rule, prior to any resolvers being executed. The focus is on a single request's operation and deciding in isolation if the server should attempt to fulfill the request. There is no partial result, it is allow or reject. Cost analysis won't care if you receive one request per second or a thousand. A malicious actor could craft an operation that remains under a cost limit, but send a large number of requests.

I personally recommend using cost analysis to achieve a similar outcome over depth limiting. Depth limiting is for an arbitrary depth which can be a challenge for different types which may take different approaches to data organization. Cost analysis doesn't care about and multipliers can help in this regard.

GraphQL Rate limiting is about the frequency and contents of operations. Rate limiting is done at execution time for each field/type being limited, resulting in a bit of overhead in comparison to cost analysis, but the upside is that it ensures that over all requests, limits are enforced for the time window. Rate limiting can be used to provide amount limiting. Once a limit is reach, it will return an error for the field, but does not error out the entire operation (so the client can receive a partial success response).

I believe the time to implement is comparable between cost analysis and rate limiting, but I am biased as I have already used both.

Per the article Securing Your GraphQL API from Malicious Queries, it was written before open source GrpahQL rate limiting solutions were published. My understanding is that Spectrum now also uses rate limiting as part of their approach to defensive measures.

Hope that helps, happy to clarify specific aspects further.

Did I understand correctly that your package does the following 2 things?

  1. limits the time of operations execution
  2. limits the number of operations

The query cost/complexity packages do only 1 thing:

  • limit the maximum allowed cost/complexity of operation

But they don't limit the number of operations which have cost/complexity below the threshold (there may be a lot of them).

From a client development point of view one of the worst API experiences is a slow API. We don't want queries to take too long. (See https://medium.com/@ossareh/have-you-considered-time-limiting-the-execution-7dcd14037653.)

It's assumed that more complex queries require more time to execute. (Is this assumption true? What do we actually mean by cost/complexity? What is the unit of measurement of cost/complexity? Is it an understandable physical quantity/property (or their multiplication/division) or something abstract?)

When we limit the complexity, we indirectly limit the time to execute? But limiting the cost/complexity requires some additional work to define how complex certain resolvers are via the performance tracking data (e.g. exposed by Apollo Engine).

BTW, does the execution time limitation require this kind of additional work using the performance tracking data? How to find adequate thresholds for execution time limitation?

Why not limit the time of execution directly? When we limit the time of execution, we automatically limit the cost/complexity. If the query is very complex, it takes a lot of time to execute, will not pass the threshold value and will be rejected. Right?

Does this mean that the cost/complexity validation can be replaced with a more understandable execution time limitation? (Time is an intuitive understandable physical quantity/property, measured in (milli)seconds.)

The only thing missing is a limitation on the number of operations (to avoid the attack of a huge number of queries with execution time below the threshold or cost/complexity below the threshold).

In other words, do the cost/complexity validation + number of operations limitation provide the same level of protection as the execution time limitation + number of operations limitation?

I'm worrying that if I use both kind of packages (for query cost/complexity analysis/validation and for query rate limitation) at once, I will end up to cost/complexity validation + execution time limitation + number of operations limitation (and using the first two of which at the same time is overvalidation/overlap).

The imperfection of query rate limitation approach, that I see, is that it's an a posteriori limitation, not an a priori validation (applied prior to execution).

Does this mean that a malicious actor could develop a strategy in which he will get an error instead of data, but still will consume server resources?

Is query rate limitation approach suitable for highload and/or public GraphQL API (with a huge number of users, where part of the queries should work without authentication)? How in this case to limit the specific user, without limiting the other users?

How does your package work with (nested) fragments, union types, interfaces?

I will reply to the shorter comment first but will followup shortly for the other comment too.

Is query rate limitation approach suitable for highload and/or public GraphQL API (with a huge number of users, where part of the queries should work without authentication)?

Rate limiting I believe is suitable regardless of the expected load. It is a defensive measure to protect your server resources from mistakes and abuse. The reasons to use or not use rate limiting itself is similar to any other API.

I originally created this package to be used for an open GraphQL API, by which I mean a public API that external (third party) developers can use to view and modify data. Same idea as GitHub v4 and Shopify's Admin GraphQL APIs. We plan to start using this package in the near future to help provide some boundaries around usage.

A request to a GraphQL server is generally either authenticated or not using something like the Authorization header from the HTTP request (instead of passing the token as a resolver argument). Resolvers may do authorization checks, but that is independent of rate limiting. The order of operations of rate limiting vs authorization depends on how you implement authorization. If authorization was done in the resolver, then rate limiting would be applied first.

How in this case to limit the specific user, without limiting the other users?

See Context example. Essentially your GraphQL server will build a context early in the GraphQL request lifecyle (which could include things like an ip or userId information). Then the rate limiter can use the information from context to build a key to represent the field being resolved (potentially unique per user).

How does your package work with (nested) fragments, union types, interfaces?

The directive can currently be applied to only OBJECT and FIELD_DEFINITION. See GraphQL spec for a list of all valid option.

I don't believe it makes sense to annotate this directive on a query fragment as these are written by clients. The server will execute the necessary resolvers matching what the fragment is asking for. I haven't used union types heavily before so I left them out. I believe the impact of this is that each object type in the union would need to have the directive applied individually. I am not sure what the expected behaviour of an interface would be.

Did I understand correctly that your package does the following 2 things?

  1. limits the time of operations execution
  2. limits the number of operations

The query cost/complexity packages do only 1 thing:

  • limit the maximum allowed cost/complexity of operation

Rate limiting does not take into account the expected average, p95, p99, etc response time of a resolver. The fixed time window refers to the number of operations permitted per period, regardless of it they are done in a single request or spread over hundreds of requests. It is only limiting the number of operations/fields.

But they don't limit the number of operations which have cost/complexity below the threshold (there may be a lot of them).

Rate limiting does limit the number of occurrences both within a request and across many requests within a time period, putting an upper bound on the count threshold. Query analysis limits by a definition of cost/complexity, putting an upper bound on the intensity permitted in a single request.

From a client development point of view one of the worst API experiences is a slow API. We don't want queries to take too long. (See https://medium.com/@ossareh/have-you-considered-time-limiting-the-execution-7dcd14037653.)

One stance I would posit with GraphQL is that the API timing experience is not just dependent on the server, but also on what the client is asking of the server. Certainly the majority of the onus is on the server to be fast and to prevent less than ideal operations, but the client does get to specify what it wants. As with everything, there are use case dependent tradeoffs such as fast and chatty vs slow and data intense APIs ...is the client a user experience or the server of an external partner.

It's assumed that more complex queries require more time to execute. (Is this assumption true? What do we actually mean by cost/complexity? What is the unit of measurement of cost/complexity? Is it an understandable physical quantity/property (or their multiplication/division) or something abstract?)

When we limit the complexity, we indirectly limit the time to execute? But limiting the cost/complexity requires some additional work to define how complex certain resolvers are via the performance tracking data (e.g. exposed by Apollo Engine).

For projects I have been involved with, server implementation details shaped how we defined the cost/complexity. Is there a cache involved, are we returning constants, is data stored in an external vendor's system (like a CMS with its own constraints), is the data paginated, is dynamic data aggregated required, etc. I will leave it between you and a cost analysis implementation to determine how best to apply it.

BTW, does the execution time limitation require this kind of additional work using the performance tracking data? How to find adequate thresholds for execution time limitation?

Why not limit the time of execution directly? When we limit the time of execution, we automatically limit the cost/complexity. If the query is very complex, it takes a lot of time to execute, will not pass the threshold value and will be rejected. Right?

Neither cost analysis or rate limiting focus on primarily using execution timing information to inform their decisions. Engine certainly has aggregate execution information, and does expose it to developers through tools like Apollo GraphQL's performance insights. Hopefully some day Apollo will expose their internal endpoints to enable the possibility of validation rule to use the timing information. In the mean time, you have to manually convert your data insights (perhaps from Engine) into the configuration of cost and limits. Limiting by only timing information does not protect your server from receiving and trying to execute an overload of requests.

The only thing missing is a limitation on the number of operations (to avoid the attack of a huge number of queries with execution time below the threshold or cost/complexity below the threshold).

This is what this package tries to exclusively focus on, the number of operations and their composition.

In other words, do the cost/complexity validation + number of operations limitation provide the same level of protection as the execution time limitation + number of operations limitation?

I'm worrying that if I use both kind of packages (for query cost/complexity analysis/validation and for query rate limitation) at once, I will end up to cost/complexity validation + execution time limitation + number of operations limitation (and using the first two of which at the same time is overvalidation/overlap).

I am not aware of "execution time limitation" implementations at this time. Were there any you were looking at?

The imperfection of query rate limitation approach, that I see, is that it's an a posteriori limitation, not an a priori validation (applied prior to execution).

Does this mean that a malicious actor could develop a strategy in which he will get an error instead of data, but still will consume server resources?

This package implementation of rate limiting is to limit prior to execution per field resolver. This is during the execution phase of a GraphQL request, so GraphQL server resources will be consumed to process the request but start returning an error or object upon a limit being reached. This will protect the resources sitting behind the api server (dependent on where the limits are annotated relative to how your resolvers are structured). GraphQL responses are generally not all or nothing, this allows partial success responses to clients, even if the rate limiting prevents data for every requested field from being returned. This will allow a malicious actor, just like an ordinary actor, to consume up to the permitted quantity. The package's design doc speaks to three possible means to apply rate limiting logic.

@FluorescentHallucinogen closing this issue due to inactivity. Hopefully the information provided helps give you some ideas for the next steps on protecting your server.