apache/superset

Server side pagination for charts (tables)

Closed this issue Β· 23 comments

Is your feature request related to a problem? Please describe.

As a analyst I need to pull in the entire resultset to do my work. There is no way to either catagorize or aggregate information before hand. If I pull the entire result set (25M rows) I crash the browser or if I enforce a limit then the subsequent filters don't work anymore.

It balloons memory usage on both the server, caches, and web client. This makes it close to impossible to use Superset for this use case.

Describe the solution you'd like

Pagination should happen on the server side rather than on the client side. This means I will pull in limited set of data to the client and when it is out of data it will reach out to the server to request the remaining data.

Describe alternatives you've considered

SQL lab does not cut this as the end users are not able to integrate this with their work flow.

Additional context

This was previously mentioned (paraphrased) in #1433 , #4037 without resolution.

@mistercrunch @kristw @john-bodley

Trying to understand your use case better, are you trying to download that 25M rows to your laptop ultimately to do notebooky-type things? Comb through every record? Slice and dice the result set?

Currently when operating in async mode, SQL Lab (on a celery worker) will dump the larger result set to a caching backend (we typically use S3) to allow for CSV exports. Two limits are applied, one to what the database server will return, one to what the browser will get.

On the user side, one approach could be to CREATE TABLE AS first using the IDE, and then running some LIMIT / OFFSET statements against the result set. A similar approach could be used by the system, assuming that:

  • we and/or the user has the rights to CTAS on that database (or maybe another database serving that purpose, but that's starting to get complex)
  • the database's LIMIT OFFSET probably isn't super efficient and hopefully ordering is deterministic, it's rarely not homogenous across database engine, like with Presto you'd have to do some tricks around row_number()

Alternatively maybe the web server could page when reading the CSV out of S3.

To elaborate:

We build an interactive dashboard for investigators. Those investigators are not able (knowledge wise) to create their own dashboards / queries for this particular investigation.

So, we have two tables one with 25M transactions and another with 6M unique account numbers. A users needs to be able to filter the account number table to find a specific account number. Selecting this should then also filter the transaction table which can be a lot of transactions (easily > 70K)

We have two options we have a filter box to select the account number this however seems to have a hard coded limit of much lower than the 6M we need. So this means missing account numbers when a user searches. Second option is to create another table which has a search box and can emit filter events. We can control the limit here, but 6M accounts will be sent to the browser as filtering happens clients side even if we use 'pagination'. If we set the limit to something the browser can handle the search will miss account numbers.

On your comments. I think you are missing the mark, cause you seem to be talking about exports. We literally just need to be able to filter a large table on the front end without blowing up the browser and doing efficiently as multiple users will use the same tables.

If needed we can have a video chat to show the issue. Fixing this would really help superset adoption (at least with us ;-) ) as it enables some really big dashboards.

I might be overlooking something, but adding support for pagination in the backend should be fairly simple: Adding an offset element to query_obj and then adding that to the select statement in the SQLAlchemy model should be enough. However, I can't speak for how big an impact this change would require in the frontend. But I do see the value of this proposition, as it would make it possible to default to very low limits (e.g. 50), and then lazy loading more if necessary, if the DB handles offsets efficiently.

I have been looking into this, and think I have it sorted on the backend side. However, I decided not to put in a PR yet, as the final backend design will most likely change slightly based on what works best in the frontend, so needs to be developed in parallel with the frontend (was thinking about starting with the table plugin as a POC https://github.com/apache-superset/superset-ui-plugins/tree/master/packages/superset-ui-legacy-plugin-chart-table ). So once the plugins repo opens up to community PRs I will be happy to work on this.

stale commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.

nytai commented

While adding pagination to the table chart (especially ungrouped ones) should definitely be done, I think the table chart's search functionality would need to hit the backend or be part of the query for this to solve the use case brought up. I'm fairly certain the table charts search is local (ie, it's only searching over the data held in the client).

Something this issue highlights is that the filter box runs one query to fetch it's data (with an upper limit) and just runs a local search over that data. A much better design would hit the backend while searching to fetch additional rows lost with the initial limit. This could mean a slow search experience, depending on the cardinality/database conditions, so a an option to enable/disable this behavior would probably make sense.

Hi All,
Has this been implemented in new version of superset?
Can we implement something like this for pagination?
https://datatables.net/examples/data_sources/server_side

I think the backend request is best handled by simply adding the offset parameter to the request, making it possible to paginate any type of query, whether they include groupings or not. I like the DataTables example provided by @anandhurajan , and I think a similar search-functionlity is best handled by either extending the filter box chart, or designing a new native component as discussed/proposed in #8452 .

Hi All,
Any update on this?

As suggest by @villebro I applied offset parameter to query_obj by adding a offset-limit widget on table-chart. We are able to render the limited/specific( i.e every time, we have to provide static-offset-value) records based on provided offset value(static-offset-limit), but I could not reach @anandhurajan scenario(https://datatables.net/examples/data_sources/server_side) where offset value is dynamically created.

Our scenario is to get dynamic offset value based on page_number and show_entries, for the same it should be called explore_json/any_api to get data_resource by running SQL_Query from back-end. Do you have any idea to manipulate/update on front-end to get dynamic offset-values( page-number*entries), just like as provided on client_side_pagination on table view.

Hi All, Any plans on this?

I'll probably be adding backend support for pagination to the new /api/v1/chart/data endpoint in the coming weeks as part of SIP-38 work. However, I currently can't speak for when support will be added to individual visualizations, as the current SIP-38 effort aims to address some fundamental problems in the current viz architecture. But hopefully these types of features can more easily be added by the community in the future, as the viz plugin development process becomes less technically challenging.

Hi All, Any update on this?

@villebro do you have any idea when we're adding it to the table?

As suggest by @villebro I applied offset parameter to query_obj by adding a offset-limit widget on table-chart. We are able to render the limited/specific( i.e every time, we have to provide static-offset-value) records based on provided offset value(static-offset-limit), but I could not reach @anandhurajan scenario(https://datatables.net/examples/data_sources/server_side) where offset value is dynamically created.

Our scenario is to get dynamic offset value based on page_number and show_entries, for the same it should be called explore_json/any_api to get data_resource by running SQL_Query from back-end. Do you have any idea to manipulate/update on front-end to get dynamic offset-values( page-number*entries), just like as provided on client_side_pagination on table view.

can you please share your approach? I'm trying to get this feature on 0.36. Thanks in advance

Some notes here:

  • if we're not caching the result set, this means every page switch means a database scan
  • pagination to be deterministic assumes ordering, meaning that in @bolkedebruin 's example of 25M rows with 100 row pages, skimming through 10 pages of 100 row each means scanning / sorting 25M row for each page
  • this seems like potentially a lot work for the database to serve little to the user

Ultimately we could have "advanced pagination" options that let the user choose between server and client-side pagination options.

I think that when people say "I want to browse through millions of rows" it's pointing to other problems. It very rarely makes sense for anyone to comb through millions of individual rows.

Hi, Any solution for this?

limit/offset is not an option for pagination: it is terribly inefficient as all the previous pages should be computed each time. Please rely on primary keys if available.

Having a server-side cache would allow to have a coherent result (avoiding changes when navigating through the pages) and pave the way to fix big CSV export (#10953).

@mistercrunch @frafra The typical way to solve this is to rely on a primary key or on a index and remember where you left off. In mariadb terms SELECT ... WHERE ... AND id < $left_off ORDER BY id DESC LIMIT 100. This allows the database to make use of the physical layout of the data on disk and is extremely efficient. There is probably no need even to cache the result in case of traditional rdbms (e.g. Postgres, Mysql; Hive/Presto you do want to cache due to query lead times)

So what could be possible is to enable server side pagination for tables that have a primary key or an index and disable it for tables that don’t.

Added this request to Superset roadmap, will get to it Q2-Q3.

FYI @bolkedebruin this is currently being implemented here

this was fully implemented lately:
apache-superset/superset-ui#985
Lets start closing issues ;)

Hi ,

I'm still facing the issue in server side pagination ... I'm not able to navigate to different pages ... I have attached the video recording ...

Screencast.2021-07-20.00_24_48.mp4

Uploading Screencast 2021-07-20 00_24_48.mp4…