WorkloadTools issue with Commvault Backups
Closed this issue · 24 comments
I have a listener writing to a WorkloadFileWriterConsumer which starts recording just fine, but as soon as our Commvault backup process kicks in and starts its incremental backup of the database no more events are being recorded.
I can see the logs just stops writing, I don't see any relevant events in the event viewer.
As you can see below it is recording events, then just stops, I can see in the sqlite database where it records the Commvault queries and then nothing is recorded from that point on other then empty 4 and 6 events.
2023-11-23 11:58:10.8112 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Summary info written (1 rows)
2023-11-23 11:58:10.8112 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 278 rows aggregated
2023-11-23 11:58:10.8112 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 54 rows written
2023-11-23 11:58:10.8268 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Performance counters written
2023-11-23 11:58:10.8268 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Wait stats written
2023-11-23 11:59:10.8267 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Summary info written (2 rows)
2023-11-23 11:59:10.8267 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 1442 rows aggregated
2023-11-23 11:59:10.8267 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 92 rows written
2023-11-23 11:59:10.8424 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Performance counters written
2023-11-23 11:59:10.8580 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Wait stats written
2023-11-23 12:00:10.8869 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Summary info written (2 rows)
2023-11-23 12:00:10.8869 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 97 rows aggregated
2023-11-23 12:00:10.8869 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 35 rows written
2023-11-23 12:00:10.9025 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Performance counters written
2023-11-23 12:00:10.9181 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Wait stats written
2023-11-23 12:01:10.9121 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Summary info written (0 rows)
2023-11-23 12:01:10.9121 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 0 rows aggregated
2023-11-23 12:01:10.9121 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 0 rows written
2023-11-23 12:01:10.9277 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Performance counters written
2023-11-23 12:01:10.9277 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Wait stats written
2023-11-23 12:02:10.9524 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Summary info written (0 rows)
2023-11-23 12:02:10.9524 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 0 rows aggregated
2023-11-23 12:02:10.9524 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 0 rows written
2023-11-23 12:02:10.9524 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Performance counters written
2023-11-23 12:02:10.9681 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Wait stats written
2023-11-23 12:03:11.0010 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Summary info written (0 rows)
2023-11-23 12:03:11.0010 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 0 rows aggregated
2023-11-23 12:03:11.0010 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 0 rows written
2023-11-23 12:03:11.0010 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Performance counters written
2023-11-23 12:03:11.0167 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Wait stats written
2023-11-23 12:04:11.0267 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Summary info written (0 rows)
2023-11-23 12:04:11.0267 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 0 rows aggregated
2023-11-23 12:04:11.0423 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : 0 rows written
2023-11-23 12:04:11.0423 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Performance counters written
2023-11-23 12:04:11.0423 - Info - WorkloadTools.Consumer.Analysis.WorkloadAnalyzer : Wait stats written
Would you mind testing this with version 1.7.0 that I just released?
Thanks for the reply, unfortunately this did not help, same issue.
Bummer. Do events show up in the source XEvents session? If you watch live data, do events appear?
Sorry I am not sure what you mean. I am not a DBA by any means I am code jokey who is trying to work out what impact changing the compatibility mode of our SQL database from 2008 to something newer. So while I can right SQL statements my knowledge of underlying SQL is limited.
Can you explain in a bit more detail where you would like me to look.
Sorry
No worries, I ran into the same bug. I'm sorry about the issue, I will try to solve it ASAP.
It looks like we have a bug here, but I am unable to fix it. It has nothing to do with CommVault, but it has to do with something in the code that stops the collection of events from the stream of the XEvents session.
I verified that it's not happening in 1.6.2, so it's something introduced from 1.6.5.
@a-teece are you able to reproduce it? It's mostly your code, but I have a hard time understanding what change exactly triggered the problem.
It looks like StreamXEventDataReader.ReadEvents blocks when it performs eventsEnumerator.MoveNext(), but I really can't tell why. Help!
@spaghettidba I've been through all the changes in StreamXEventDataReader itself and nothing has functionally changed - lots of changing variables to be inferred and use var, lots of braces on separate lines, etc. All just code consistency without changing the logic.
Which means it's something above StreamXEventDataReader in the stack. Still looking....
Yes, I came to the same conclusion. It must be something else, but I really can't tell what. :(
BTW, are you able to reproduce this behavior? For me it stops producing results in the analysis database after 1 or 2 minutes in average, in general before the 5 minutes mark.
@spaghettidba Going to try now. Most of my work was on replaying the working because the capture was done on a much older build. I'll go through the changes that are common to capture/reply and see if something that was changed to solve replay issues had an unintended consequence on the capture.
@spaghettidba I can't replicate, however my fork doesn't have your latest changes, it only has the last ones I sent you. So I'm going to update it and try again which would help narrow down the change.
@spaghettidba I can't replicate at all when running from Visual Studio, latest committed code, against SQL 2017.
I see...
It's weird. For me it's consistent: 1.7.0 stops recording at some point. Same for 1.6.5
If I try with 1.6.2, it works flawlessly. I'll see if I can find the problem.
Thanks for your help!
@spaghettidba Just trying now on an installed 1.7.0
@spaghettidba Running for 10 min on a deployed 1.7.0, SQL 2017.
Final try is i'm letting my test DB go quiet for a little bit. See if 0 events for awhile causes it.
@spaghettidba I'm still going after an hour and at numerous points I shutdown my workload to give long pauses of 0 activity.
What SQL version can you replicate on; I'm using 2017. What config file; I'm using the default AnalysisSample.json with different server names, db names and passwords
I'm using 2022 both for replay target and analysis target. I'm using a very simple analysis.json, same as you are (sample with changed server and database).
I'm running a full replay with 1.6.2 to see if I'm losing events, after that I'll try another run with 1.7.0 to reproduce the incorrect behavior and compare results.
@spaghettidba I need to push my dev machine to SQL 2022 anyway so will try and find to check with that too
Thank you very much! I can confirm that 1.6.2 runs perfectly. Now I'll try again 1.7.0 and will post the results.
@a-teece confimed, 1.7.0 stops collecting Extended Events data after one or two minutes.
One thing to try to reproduce the behavior might be this: I have several timeouts in this replay, so things that run slower on 2022 compared to 2019 (where the production workload was recorded). Also the operation mentioned by @SimonBillinghurst is a backup, so likely runs longer than a collection window of 1 minute. Could it be that long running processes break the XE collection? A simple WAITFOR DELAY '00:10:00' in the workload might help troubleshooting.
Found it! Super sneaky default value assignment introduced in 1.6.5 to remove \0 chars at the end of strings:
StreamXEventDataReader.cs:269
Current code:
// check whether last char is a null char (\0)
// because this breaks writing this string to the sqlite database
// which considers it as a BLOB
if(result is string stringValue)
{
while (stringValue.EndsWith("\0")) // <--- see? checks stringValue but assigns result in the loop
{
result = stringValue.Remove(stringValue.Length - 1);
}
}
return result;
New code:
// check whether last char is a null char (\0)
// because this breaks writing this string to the sqlite database
// which considers it as a BLOB
if(result is string stringValue)
{
while (stringValue.EndsWith("\0")) // <--- see? checks stringValue but assigns result in the loop
{
stringValue = stringValue.Remove(stringValue.Length - 1);
}
result = stringValue;
}
return result;
I'm releasing 1.7.1 soon
@spaghettidba Really sorry this was in my change. Not sure why I cannot reproduce on my dataset. Assuming that none of my workload includes a value that ends with \0
No need to be sorry! You introduced great changes to the code and I'm immensely grateful for your contributions!
Moreover, I think it was the code analyzer to propose that change, you simply accepted it.
I'm not sure where these \0 chars come from, but it seems like some applications end their commands with that control sequence and it breaks some other parts of the code later down the line in WorkloadTools, so I simply get rid of them.
Downloaded the latest version and now seems to be working.
I really appreciate the effort guys, thanks very much.