drupalwxt/helm-drupal

Using an Azure MySQL DB is twice as slow as an AKS hosted MySQL DB

Closed this issue · 32 comments

Good afternoon Will and Zach,

This is not an issue with the helm chart per say... but rather a question regarding the recommendation to use a CSP PaaS DB vs a K8s POD based DB for production. I just tried migrating our site to an Azure MySQL Flexible instance and I am finding that the site is less responsive... about half as responsive. The Azure MySQL PaaS instance is using a Private Endpoint on the same vnet as the AKS cluster running the Drupal solution. The MySQL PaaS instance is also super beefy... like as big as the node pool running the whole Drupal solution ;-) and yet it still perform poorly (not that the POD based DB was fast either).

Have you experienced the same at Stats Canada? Do you have any idea why this would be the case?

Regards,

Bernard

What is even stranger is that using sysbench I can confirm that the Azure MySQL is way faster than the AKS POD MySQL instance... but yet serving pages is twice as slow when using Azure MySQL...

POD MySQL:

root@troubleshooting-shell:~# sysbench /usr/share/sysbench/oltp_read_write.lua --mysql-host=drupal-wxt-prod-mysql --mysql-port=3306 --mysql-user=wxt --mysql-password='dgdsfgfdsdfgsd9G' --mysql-db=wxt --db-driver=mysql --tables=3 --table-size=10000 --report-interval=10 --threads=128 --time=60 run 2>&1
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 128
Report intermediate results every 10 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 10s ] thds: 128 tps: 36.17 qps: 870.10 (r/w/o: 630.92/154.36/84.82) lat (ms,95%): 5312.73 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 128 tps: 43.40 qps: 879.95 (r/w/o: 619.73/173.61/86.60) lat (ms,95%): 4943.53 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 128 tps: 44.80 qps: 872.40 (r/w/o: 605.60/176.70/90.10) lat (ms,95%): 4943.53 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 128 tps: 44.70 qps: 896.60 (r/w/o: 630.40/177.00/89.20) lat (ms,95%): 5217.92 err/s: 0.10 reconn/s: 0.00
[ 50s ] thds: 128 tps: 42.60 qps: 860.10 (r/w/o: 602.20/172.80/85.10) lat (ms,95%): 5312.73 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 128 tps: 43.40 qps: 863.50 (r/w/o: 601.70/175.00/86.80) lat (ms,95%): 5409.26 err/s: 0.00 reconn/s: 0.00
SQL statistics:
    queries performed:
        read:                            37492
        write:                           10708
        other:                           5354
        total:                           53554
    transactions:                        2676   (43.09 per sec.)
    queries:                             53554  (862.41 per sec.)
    ignored errors:                      2      (0.03 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          62.0964s
    total number of events:              2676

Latency (ms):
         min:                                  202.49
         avg:                                 2926.82
         max:                                 9196.74
         95th percentile:                     5217.92
         sum:                              7832167.98

Threads fairness:
    events (avg/stddev):           20.9062/1.71
    execution time (avg/stddev):   61.1888/0.63

Azure MySQL:

root@troubleshooting-shell:~# sysbench /usr/share/sysbench/oltp_read_write.lua --mysql-host=myssc.mysql.database.azure.com --mysql-port=3306 --mysql-user=azureadmin --mysql-password='dfgdfgdsffdgdg' --mysql-db=wxt --db-driver=mysql --tables=3 --table-size=10000 --report-interval=10 --threads=128 --time=60 run 2>&1
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 128
Report intermediate results every 10 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 10s ] thds: 128 tps: 240.64 qps: 5108.67 (r/w/o: 3623.26/985.54/499.87) lat (ms,95%): 1708.63 err/s: 5.90 reconn/s: 0.00
[ 20s ] thds: 128 tps: 265.91 qps: 5424.82 (r/w/o: 3812.09/1074.22/538.51) lat (ms,95%): 1561.52 err/s: 6.60 reconn/s: 0.00
[ 30s ] thds: 128 tps: 235.31 qps: 4815.79 (r/w/o: 3391.73/947.14/476.92) lat (ms,95%): 1708.63 err/s: 6.30 reconn/s: 0.00
[ 40s ] thds: 128 tps: 201.70 qps: 4170.93 (r/w/o: 2940.32/818.91/411.70) lat (ms,95%): 2120.76 err/s: 8.30 reconn/s: 0.00
[ 50s ] thds: 128 tps: 273.79 qps: 5605.46 (r/w/o: 3943.40/1106.47/555.59) lat (ms,95%): 1479.41 err/s: 8.00 reconn/s: 0.00
[ 60s ] thds: 128 tps: 246.30 qps: 5032.82 (r/w/o: 3536.41/997.60/498.80) lat (ms,95%): 1648.20 err/s: 6.20 reconn/s: 0.00
SQL statistics:
    queries performed:
        read:                            212520
        write:                           59672
        other:                           29945
        total:                           302137
    transactions:                        14765  (241.76 per sec.)
    queries:                             302137 (4947.10 per sec.)
    ignored errors:                      415    (6.80 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          61.0717s
    total number of events:              14765

Latency (ms):
         min:                                   41.12
         avg:                                  525.00
         max:                                 5863.51
         95th percentile:                     1678.14
         sum:                              7751658.26

Threads fairness:
    events (avg/stddev):           115.3516/11.06
    execution time (avg/stddev):   60.5598/0.33
sylus commented

@bernardmaltais we actually solved this, and went from pretty horrible performance to it being roughly 3-5x faster.

You will need to use:

https://docs.microsoft.com/en-us/azure/mysql/howto-redirection

See how we did it:

Docker

drupalwxt/docker-scaffold@103f2b6

Helm Chart

  # PHP configuration
  php:
    ini:
      mysqlnd_azure.enableRedirect: on

  extraSettings: |-
    $databases['default']['default']['pdo'][PDO::MYSQL_ATTR_SSL_CA] = '/etc/ssl/mysql/BaltimoreCyberTrustRoot.crt.pem';

The problem is the Azure Gateway adds massive latency to requests and makes Azure MySQL pretty unusable, but with the redirect module in place you talk directly to the backend bypassing this and you will notice via Azure Metrics your actually getting high IOPS again instead of it being below 10% most of the time.

The recalls site uses this and also Azure CDN and is pretty darn fast:

https://recalls-rappels.canada.ca/

sylus commented

@bernardmaltais although i'm only using the regular Azure MySQL PaaS service with 512GB storage size for high IOPS. It works pretty well. I haven't tested the redirection module with the flexible service though.

https://github.com/canada-ca-terraform-modules/terraform-azurerm-mysql

@sylus Thank you for the pointer... I guess it is time to update our scaffold stuff ;-)

The flexible service is more "flexible" and actually less expansive than the basic offering... and can be made HA.

sylus commented

@bernardmaltais our problem at least when i last checked is the flexible service is missing some key security features such as key vault integration for CMK, threat_detection_policy, and managed identities.

We have started working on the terraform spec though and starting with the PostgreSQL one:

https://github.com/canada-ca-terraform-modules/terraform-azurerm-flex-postgresql

sylus commented

@bernardmaltais but yeah I'd be interested if the redirect module works with the flexible service so please let me know!

I was thinking since is using private zones there wouldn't have been a performance hit but last I tried the flexible service performance was pretty bad and very similar to the regular PaaS MySQL without the redirection, so I am pretty certain somehow it suffers from the same problem!

@sylus I know what I will work on this week-end.

sylus commented

I would check if the flexible service supports it in the regular Managed PaaS there is a configuration parameter. I wonder if it is there for the flex service?

On the server side, redirection also needs to be enabled. You can enable redirection by updating the redirect_enabled server parameter using the Azure portal or Azure CLI. You can also use "show global variables like 'redirect_enabled'" to check the current redirection enable status on server side.

I am not attached to the flexible version as I just started testing today ;-) I will rebuild with the basic version and make it working 1st. Then I can try switching to flexible and see if this also work... baby steps. Man, this is turning in much more work than I expected... I will report back.

@sylus Just saw that redirection does not support Private Link... arg! Not so pleased about that. Azure is a bit of a bag of hurt when it comes to AKS and PaaS DB access!

sylus commented

Yeah though not AKS related since its really their database implementations that are the core problem. Azure Web Apps suffers from the same problems though now they have an option for the redirect issue as well.

https://stackoverflow.com/questions/63970567/php-on-azure-app-service-slow-performance-when-connected-to-azure-database-for-m

Interesting while PostgreSQL and MySQL suffer from this problem strangely MSSQL doesn't as they do something underneath the scenes for some reason.

Just a reminder don't forget to set your storage size to 512GB to get decent IOPS which is what we do in conjunction with the redirect setting. Then just do a reserved instance for a year or 3 if u can and will save 60%

@sylus I have been able to deploy and enable mysqlnd_azure... but I am not seeing any performance difference with my previous Flexible Private Endpoint DB... Maybe the responsiveness issue is not related to an Azure Gateway latency afterall... But it was worth trying:

bash-5.0$ php test.php
mysqlnd_azure.enableRedirect: 1
b1e662624122.tr2764.canadacentral1-a.worker.database.windows.net via TCP/IP
mysqli_result Object
(
    [current_field] => 0
    [field_count] => 1
    [lengths] =>
    [num_rows] => 294
    [type] => 0
)

The strange thing is that running sysbench show faster throughput and smaller latency when using the Azure MySQL DB vs POD... but real world site performance it at the opposite I have no idea what is going on.

sylus commented

So you did the following?

a) Enabled the mysqlnd redirection on php side
b) Enabled the redirection server side in the mysql portal in the database configuration parameters
c) You are using tls 1.2 and the baltimore cert
d) sqlproxy is disabled

The performance difference was night and day for me, a fresh drupal install went from about 20-25 mins down to about 4-6 mins. If you do a fresh install how long does it take? Also while it is doing the install and you go to the database and select metrics and then IOPS do you see it peaking above say 60% and not always below 10%?

sylus commented

I also did some minor tweaks MySQL wise but I don't think any of these settings were particularly important:

https://github.com/canada-ca-terraform-modules/terraform-azurerm-mysql/blob/master/main.tf#L85...L143
https://github.com/canada-ca-terraform-modules/terraform-azurerm-mysql/blob/master/variables.tf#L233...L271

And finally the storage size of your db is set to 512?

a) Enabled the mysqlnd redirection on php side : yes
b) Enabled the redirection server side in the mysql portal in the database configuration parameters : yes
c) You are using tls 1.2 and the baltimore cert : yes
d) sqlproxy is disabled : (I never enabled sqlproxy... so unless it is on by default I should not have to do that)

Doing a DB restore (fairly small DB that take 10 seconds on my dev machine to restore) take about 5 minutes on POD. I did notice that the Azure MySQL was giving my equivalent page loads as the POD based DB... so maybe the direct connect made a slight difference in leveling things out... but DB performance is still bad compared to a local dev laptop.

I am not expecting performance to be equivalent... but I was hoping to at least reduce the small restore to take only 1 minutes instead of 5 ;-(

I could share a copy of the DB if you are willing to try to restore on your AKS cluster vs on mine.

I started with a storage size of 256 and that gave me equivalent page load time as the POD. I increased the size to 512 and for some reasons page load time went up by one seconds after...

image

sylus commented

Yeah i'd expect a hit of around 20% in my experience then local and the MySQL db will def not be faster then an in cluster db as well due to networking and other considerations so that will never match unfortunately. I'd take a look at the mysql configuration I did above and set similar values which should help you a bit.

If you want near performance as your local sadly it will be flex server since technically gives you your own VM and SSD but the gateway problem will be your limiter.

I am using Gen5_v4 at 512GB storage and MySQL v8.0

OK, I will try making the suggested changes and then try upgrading the DB to match yours... maybe this will make a difference.

sylus commented

Yeah so:

Gen5_v4
512 GB for 1532 IOPS
mysql_version 8.0
innodb_buffer_pool_size 16106127360
max_allowed_packet 536870912
table_definition_cache 5000
table_open_cache 5000

I obviously use redis and varnish as well to save both logged in and anonymous users performance respectively.

I do use varnish... but not redis... maybe the fact that I am not using redis is putting too much pressure on the DB for caching... and is not helping... Still... does not explain why it would take 5 minutes to restore a small db...

sylus commented

Yeah but redis is mainly for logged in users and if you have lots of content / views and make user of lots of entity references etc.

Agreed 5 mins for a small db seems a bit long, I don't fully remember for the recalls site but it is about 15 gb uncompressed and I think takes about 15 mins to import, but I am doing a restore to dev tomorrow so can get a more accurate time as pretty sure it is faster then that and report back.

Basically I just log into the container and do a:

drush sql-cli < db.sql

I can confirm that a containerized db or even a VM with mysql installed was faster, but that makes sense for Azure SQL since a 512GB instance will only be 1536 IOPS and a containerized and/or VM instance on SSD would be factors higher. This however wouldn't be the case with flexibile server which should be much faster but that darn gateway :P

In the end due to security concerns, and maintenance of db being offloaded went with Azure SQL and coupled with redis + Azure CDN etc found the performance to be pretty good once caches got built up but it wont ever be as fast as local sadly.

Though can see the pages are pretty snappy in the recalls site:

recalls-rappels.canada.ca

sylus commented

@bernardmaltais just one last tip just in case, copy your db.sql.gz to the /tmp folder in container and ensure not running the restore when its in the azure file mount or the restore will take forever.

cd /tmp
drush sql-drop
zcat db.sql.gz | drush sql-cli

Also can check this to make some adjustments to improve the import then can set back:

https://docs.microsoft.com/en-us/azure/mysql/concepts-migrate-dump-restore#preparing-the-target-azure-database-for-mysql-server-for-fast-data-loads

@sylus Interesting point. I do restore from the backup file located in the Azure File volume... I will try to use a custom restore script to copy the file 1st to temp and see if this make a difference.

So far my testing has shown that running MySQL in a POD is still the fastest... Here are the time to do a restore on 3 different MySQL:

POD: 4m49s
Azure MySQL direct connect: 5m23s
Azure MySQL flexible with private endpoint: 7m1s

Bernard

sylus commented

Hey @bernardmaltais so I was wrong about flex it doesn’t need redirect so given that and it’s properly colocated it should absolutely not be slower then the single server.

As I to eventually want to move to flex server for the advantages and it being pure Linux I’d love to know what is going on as well.

Planning to take a look myself but please to let me know if found anything. I can even engage with azure support.

@sylus I will have a look at the blog post on flexible server. Not sure why it is the slowest for me... possibly a configuration issue... but you can most certainly feel the difference when accessing the site as pages load twice as slow. I reached out to MS about that and I have not gotten an answer. Not a support ticket but reaching to their AKS guy...

@sylus I have deployed redis from the chart. How can I tell if it is active? I just used the defaults from the chart values.yaml file... one of the line was commented out so I left it not knowing what it would do. Any recommendations? Your recall site is super fast and I am trying to get close to that.

Also, have you ever written a tuning guide for your site? Asking for a friend (literally) ;-)

sylus commented

Hi @bernardmaltais for redis yeah all you should need:

Here is what mine looks like:

## Redis
## Ref: https://github.com/bitnami/charts/tree/master/bitnami/redis
redis:
  enabled: true
  password: ${var.redis_password_rsams}
  image:
    registry: docker.io
    pullSecrets:
      - acr-registry
  clientInterface: PhpRedis

You should then see a redis master and 2 replicas come up. The helm chart should enable redis for you but in case doesn't just run a drush en redis. Then you should see redis enabled and connection in the status report. With redis enabled the following gets enabled in settings.php:

Probably will need to clear your cache with a drush cr. Just a note that our database was huge so we needed to bump our redis size to 16GB.

Thanks. I was able to deploy and according to JMeter is reduced logged in user page load by about 20%. I saw that APCu is also recommended to be deployed on the drupal instance... but from what I could read it look like it is better fir for single node systems and that Redis essentially does the same thing. Have you done any testing between APCu and redis?

sylus commented

As i understand it there is no need for APC (or any of the similar sort of bytecode caching extensions like XCache) as of PHP 5.5 and later. The PHP developers directly integrated what they call OPCache which we are using. For the userdata portion of ACPu yeah redis will supercede that.

sylus commented

Any word on Flex Server @bernardmaltais also did you enable proxysql im the helm chart? As they say should do that to.

@sylus I have not. I have been testing with the redirection. Once the log4j wave has passed I should have more time to look into that again. But for now the speed of the redirection is acceptable ;-)

sylus commented

Just wanted to chime back that I made the switch to Flex Server using the Business Critical tier and performance seems to be pretty good!