vendure-ecommerce/vendure

Multi-Vendor setup extremely slow on loading

Swampy469 opened this issue ยท 15 comments

Describe the bug
We are experiencing performance issues on our production (MultiVendor setup) environment. The main slowdowns occur in:

  • After SuperAdmin login
  • Facets loading on the FacetsList page
  • Inside the apply-facet-selector when searching a facetValue (Product Detail Page)

We believe the cause of the slowness is due to having around 400 channels (one for each seller) in production, whereas our test environment has only 20 channels and is very fast.

Regarding the login process, the slowness seems to be caused by loading the list of all channels. As for the facets, we are almost certain the issue is related to some interaction with the channels that somehow burdens the server.

The Vendure instance is hosted on an AWS EC2, with a very high CPU (C5.2xlarge). However, we are experiencing CPU usage spikes at 100% when these operations are performed. We have not noticed any slowdowns on the RDS Postgres (in general) or Elasticsearch where the products are loaded.

From your performance tests, we understand they were conducted with about 22 channels. Could you verify with a multi-vendor configuration and around 400 channels? We can provide all the details related to our data and environment if necessary.

To Reproduce
Steps to reproduce the behavior:

  1. Create a multi-vendor instance
  2. Create a new Facet and add around 190 facetValues
  3. Create approximately 400 channels, each with at least one product
  4. Assign the Facets to the respective channels
  5. Log in as SuperAdmin and open the facets page, and try to create a product and select a FacetValue

Expected behavior
The login process should be very fast.
Loading the facets should be quick within the facets page.
Loading the facets should be fast in the facet selection modal on the product page.

Environment (please complete the following information):

  • @vendure/core version: 2.2.4
  • Nodejs version: 20.12.2
  • Database (mysql/postgres etc): postgres

Additional context
The affected slow queries are:

  • activeAdministrator ~ 11 seconds to load
    image

  • facets ~ 7 seconds
    image

  • faceetValues ~ 12 seconds
    image

We think the issue is inside the ListQueryBuilder and the way he handle the relationship between Vendure Entities, in particular the method build.

Hi, thanks for the detailed report. I think there are 2 separate issues here:

  1. Currently the way we model channels & permissions means that when the number of channels is very high, it results in very verbose data structures to enumerate all the permissions across each channel. The fundamental solution for this is to re-think how we model the roles, permissions & channel relationship. This is discussed here:
  1. The performance of the facet query with many channels is probably some un-optimized query. This needs further investigation when I (or someone else) gets the chance.

Hi @Swampy469, is the slowiness happening when you log into a admin user who has access to all facets and channels? or all users including users with restricted channels and permissions?

Hi @Swampy469, is the slowiness happening when you log into a admin user who has access to all facets and channels? or all users including users with restricted channels and permissions?

Hi @snapshotleisure,

The slowness occurs in two different situations:

During the login process of a SuperAdmin user (as you mentioned).
When accessing the Facets section and retrieving facets inside the product/variant section.
The latter issue also affects all users with restricted permissions.

Hi @snapshotleisure I'm curious, Did you notice this slowness after migrating to Vendure 2.2.4, or did it develop over time? In this case what was the latest version where it was working fine for you?

I'm currently troubleshooting why all my requests hang or take an abnormally long time (or never respond) just after upgrading Vendure from version 2.1.8 to 2.2.7. I'm using Postgres on Google cloud SQL.

I don't have much more informations. I'm using around 60 channels and 1'000 products. I cannot reproduce this behavior on development where I have just a few channels and products.

I haven't any errors logs or anything for now so it's quite hard to understand what is happening right now.

Hi @gkielwasser, unfortunately I haven't encountered the problem yet, I havent fully used vendure yet outside of a POC. However it was @Swampy469 who was having the problem, so he might be able to share some insight to what he was experiencing, sorry!

Hi @gkielwasser, In the early days of using Vendure (version 2.1.4) in our production environment, everything ran smoothly. Even with over 300 channels, the requests responded rapidly as usual. However, we started noticing some slowness after updating to version 2.2.0. Initially, it was just the login requests (only for superadmin user) that experienced delays, taking about 5 to 10 seconds to respond. But since updating to version 2.2.4, overall performance has noticeably declined, as we've reported in this issue.

Hi @gkielwasser, In the early days of using Vendure (version 2.1.4) in our production environment, everything ran smoothly. Even with over 300 channels, the requests responded rapidly as usual. However, we started noticing some slowness after updating to version 2.2.0. Initially, it was just the login requests (only for superadmin user) that experienced delays, taking about 5 to 10 seconds to respond. But since updating to version 2.2.4, overall performance has noticeably declined, as we've reported in this issue.

@snapshotleisure Indeed I have quote the wrong issue owner, sorry for that.

@Swampy469 In my case I have 2 cloud run revisions on the production environment, one is on 2.1.8, one is on 2.2.7.

The slowness issue is reproductible. As soon as I will swap to 2.2.7, all requests will hang. Going back to 2.1.8, requests time will returns to "normal" but not immediately. I will investigate between 2.1.4 and 2.2.4 to determine since which version it is happening.

I managed to reproduce on my development env after migrating my sqlite dev database to Postgres. I will be far easier to solve.

The issue start happening specifically on Vendure 2.2.0. It does not happen in Vendure 2.1.9 and it does happen in 2.2.7.

Environment:
Node: v18.20.2
Database: Postgres
Vendure: 2.2.0

Products number: 500
Channels number: 80

In my local env the GetActiveAdministrator request take now more than 8s to resolve (less than 500ms before).
On production, it never resolves and seems to block any other requests.

@michaelbromley Do you know what introduced such a behavior in 2.2.0 ?

@gkielwasser good to hear that you can reproduce this locally. That's a major step towards isolating the cause so we can get it fixed.

Vendure v2.2.0 made a significant update to the underlying TypeORM version, so there was quite a lot of work done on the data layer. In most cases that we measured this work actually sped things up, sometimes significantly. But it looks like there are cases where the opposite is true.

With the GetActiveAdministrator I suspect it is the roles which enumerate all channels which might be slow with 80 channels. But it's strange that it was fine with the earlier version, since the amount of data being queried is not changing.

I'll try to reproduce this myself.

Analyzing the issue with @Swampy469, we noticed CPU usage spikes on the server while the database did not seem to be impacted. This leads me to think that the server might be heavily parsing the data returned by the queries. Could this be the cause?

@michaelbromley Some more insights.

In a development environment with a fresh Vendure 2.2.7 project, initial data populated, and using Postgres, I cannot reproduce any request hang issues.

From what I have observed on my side, I would say the issue is linked to Postgres. I was not able to reproduce the issue with the exact same data using SQLite. However, once I migrated the data from SQLite to Postgres using pgloader, while I was able to migrate the db and to reproduce the issue, i'm not sure of the integrity of my new Postgres db.

I have the ProductListQuery throwing 'error.no-default-tax-zone-set' error. My default channel has this default tax id.. I guess it is related to someting that went wrong during my migrations steps like a wrongly set foreign key.

It would be interesting to know if @Swampy469 you can reproduce the issue in your development environment. Also, are you using Postgres in your development setup?

Currently, I'm not convinced that the issue is related to the number of channels or products. It could also be also project specific. I will try to streamline my database to validate this, particularly by cleaning up channels.

I have catched the function taking on its own all that long time

private async activeUserCanReadRole(ctx: RequestContext, role: Role): Promise<boolean> {

async activeUserCanReadRole(ctx, role) {
        var startTime = performance.now()
        const permissionsRequired = (0, get_user_channels_permissions_1.getChannelPermissions)([role]);
        for (const channelPermissions of permissionsRequired) {
            var startTimeOnChannel = performance.now()
            const activeUserHasRequiredPermissions = await this.userHasAllPermissionsOnChannel(ctx, channelPermissions.id, channelPermissions.permissions);
            var endTimeOnChannel = performance.now()
            console.log(`activeUserCanReadRole userHasAllPermissionsOnChannel took ${endTimeOnChannel - startTimeOnChannel} milliseconds`)
            if (!activeUserHasRequiredPermissions) {
                return false;
            }
        }
        var endTime = performance.now()
        console.log(`activeUserCanReadRole took ${endTime - startTime} milliseconds`)
        return true;
    }

In my case it iterates over 73 channels permissions.
Each userHasAllPermissionsOnChannel function takes around 100ms
Total execution time is 7'500 ms

Great detective work. Yes, I can see that code path is absolutely un-optimized. I'm certain that this execution time can be drastically reduced.

In particular this part:

const user = await this.connection.getEntityOrThrow(ctx, User, ctx.activeUserId, {
relations: ['roles', 'roles.channels'],
});
const userChannels = getUserChannelsPermissions(user);

is being executed for every channel, but only needs to be executed once and then cached.

Some of my initial assumptions were a bit biased. While debugging the slowness, mainly related to the GetActiveAdministrator function, I discovered that my production package.json was enforcing typeorm 0.3.13 with a resolution rule, which I have now removed. I also updated pg to 8.12.0.

Now, in production with Vendure 2.2.7, my requests no longer hang like before. However, I still have this famous request that takes around 1 second. I have 53 channels, so it averages less than 20ms per channel. While this is not critical in my case, it could become problematic with a higher channel count.

I'm still pleased to have completely removed SQLite and now have a mirrored Postgres database type on dev & prod. This has been on my roadmap for a long time, and removing SQLite has eliminated a lot of additional work and complexity for no reason.

Good news on this:

I ran some benchmarks locally to measure the performance optimization I mention above.

Using the following query to the Admin API (the exact same query run in the Admin UI app):

query GetActiveAdministrator {
  activeAdministrator {
    ...Administrator
    __typename
  }
}

fragment Administrator on Administrator {
  id
  createdAt
  updatedAt
  firstName
  lastName
  emailAddress
  user {
    id
    identifier
    lastLogin
    roles {
      ...Role
      __typename
    }
    __typename
  }
  __typename
}

fragment Role on Role {
  id
  createdAt
  updatedAt
  code
  description
  permissions
  channels {
    id
    code
    token
    __typename
  }
  __typename
}

Baseline

5 channels, unoptimized

average response time: 85ms

Many channels

505 channels, unoptimized

average response time: 2,905ms

Optimized

505 channels, optimized per the commit above

average response time: 106ms

So that's a speedup of around 275x for ~500 channels.