Random errors and 403 when clearing authorizations with concurrent requests
bfreuden opened this issue · 20 comments
Questions
Related question: https://groups.google.com/g/vertx/c/vTCLEQsv1N0
Version
4.2.5
Context
I have a Vert.x 4 backend managing authorizations that vary during the lifetime of users (they can have new permissions, new roles, then those roles are revoked... stuff like that).
Because my backend is likely to revoke permissions or roles, when it is the case, I need to clear user permissions and fetch them again from the database. So I'm doing something like this:
user = routingContext.user()
user.authorizations().clear(mongoAuthorization.getId())
mongoAuthorization.getAuthorizations(user)
The problem is: when my UI is issuing parallel requests, sometimes the user has no permission at all, sometimes I get ConcurrentModificationExceptions.
Do you have a reproducer?
Here it is (no Mongo inside, just plain PropertyFileAuthentication):
https://github.com/bfreuden/vertx-reproducers/tree/master/concurrent-authz-issue
Steps to reproduce
Simply run the unique unit test of the reproducer project multiple times until you get an error (happens most of the time).
The issue here is that you are iterating over the Collection while it is being cleared by other thread. The Collection in background should be made thread safe so this can not happen.
@pendula95 thank you so much for your comment!
That's not the only issue unfortunately.
You can comment out the iteration over the authorizations (line 115 to 123) if you wish...:
// try to match a permission: this can throw an error sometimes (see out.txt stack traces)
if (PermissionBasedAuthorization.create("do:stuff").match(user)) {
// Authorizations authorizations = user.authorizations();
// // this iteration is also likely to raise ConcurrentModificationException on authorizations
// for (String providerId : authorizations.getProviderIds()) {
// Set<Authorization> authorizationSet = authorizations.get(providerId);
// StringBuilder builder = new StringBuilder();
// for (Authorization authorization : authorizationSet) {
// builder.append(authorization.toString());
// }
// }
replySuccess(routingContext, "<h1>protected page</h1>");
} else {
replyNotAuthorized(routingContext, "<h1>not authorized</h1>");
}
... but you still get 403 errors and and ConcurrentModificationException like this...
<h1>protected page</h1>
<h1>java.util.ConcurrentModificationException
at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1493)
at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1516)
at io.vertx.ext.auth.authorization.impl.PermissionBasedAuthorizationImpl.match(PermissionBasedAuthorizationImpl.java:59)
at io.vertx.ext.auth.authorization.Authorization.match(Authorization.java:67)
at org.bfreuden.WebServer.lambda$protectedPageHandler$3(WebServer.java:114)
... on line 114:
if (PermissionBasedAuthorization.create("do:stuff").match(user))
Honestly I even added a "big static synchronized" around the iteration... it doesn't change anything.
Without the iteration over authorizations, the source code of the /protected endpoints really boils down to this:
if (some condition) {
user.authorizations().clear();
}
// ...
if (user.authorizations().getProviderIds().isEmpty()) {
authz.getAuthorizations(user)
}
// ...
if (PermissionBasedAuthorization.create("do:stuff").match(user)) {
// 200 reply
} else {
// 403 reply
}
That shouldn't be a problem to do so.
My first guess was that I was not using Vert.x correctly. Really. But I can't find out where I'm wrong.
Yes but in implementation there are iterations over the Collections that are cleared. https://github.com/vert-x3/vertx-auth/blob/55eba3903979772f0c6216154c69a51c30138f94/vertx-auth-common/src/main/java/io/vertx/ext/auth/authorization/impl/PermissionBasedAuthorizationImpl.java#L59
This Collections in Authorization implementations should be made thread safe. DNK if @pmlopes wants to do this himself but I can jump in do it. I would introduce synchronization mechanism. I am not sure if it is the right approach as this could be blocking?
@pendula95 , please go ahead, ATM I'm focused on Json schema and updating openapi to the latest spec release
Synchronization can be troublesome because 2 requests on 2 events loops may create a dead lock, however, it's odd that an object is being shared, we may need to figure out why or maybe use CoW structures to avoid conflicts
@pendula95 I don't get how we can solve that "magically" by adding some synchronization in vertx-auth. Doing so would require to change Vert.x documentation. Users of vertx-auth would be required to change their integration code: put tons of "synchronized" keywords everywhere.
Consider the source code of my reproducer, there are various accesses to the authorizations set that can't be made "atomic" unless you put a big synchronized around it like this:
synchronized(authorizations) {
Set<Authorization> authorizationSet = authorizations.get(providerId);
StringBuilder builder = new StringBuilder();
for (Authorization authorization : authorizationSet) {
builder.append(authorization.toString());
}
}
or simply (just to guarantee that no other thread messed with authorizations between the "if" and the "getAuthorizations"):
synchronized(user.authorizations()) {
if (user.authorizations().getProviderIds().isEmpty()) {
authz.getAuthorizations(user)
}
}
The problem is indeed eventloop threads sharing the same UserImpl object.
It is actually actually not true by the way!
I removed that from the reproducer code for the sake of readability but, at some point, I had a static set collecting all instances of UserImpl during the test. Sometimes there were 2 instances, sometimes there were 3 instances.
Random again.
Very strange.
Looking at the session code, the only reason I can think why we're seeing duplicate references to user objects, is comming from the class UserHolder
which is hold on the session, so multiple request in the same session, may see the same holder.
This was an "optimization" to reduce the amount of serialization going on the session, but clearly is defeated with the delays introduced in this test.
I might be entirely wrong here, but probably, ensuring that the holder is unique would address the issue.
@pmlopes This bug is nasty. I agree the reproducer is pushing Vert.x very hard with the delays but, believe me, in a real world scenario I get multiple errors per day (especially right after login).
I would like to be able to fix it but session and authn/authz handlers are quite obscure to me. Especially the session one (I guess it is far from trivial since it is supposed to work in clustered mode as well).
TBH, when you have a web server verticle deployed multiple times, I never know what I am supposed to do:
- session handler: one instance of per web server verticle instance?
- auth handler: one instance of per verticle instance?
- session store: one instance of per verticle instance? or a singleton shared with all verticles? Or does it depend on the type of session store?
- authentication provider: one instance of per verticle instance? or a singleton shared with all verticles? Or does it depend on the type of provider?
- authorization provider: one instance of per verticle instance? or a singleton shared with all verticles? Or does it depend on the type of provider?
Could you please tell me how many instances of each "thing" I need to have?
With Vert.x I am generally inclined to avoid singletons and put an instance of each per verticle (and rely on the vert.x magic) but... I don't know really.
Depending on your answer I might be able to figure out a solution.
I'm realizing this bug is probably a vertx-web bug (not a vertx-auth bug I mean)?
@bfreuden I'm running your reproducer. With 4.2.5 indeed the test fails for me too, yet with the latest 4.3.0-SNAPSHOT (it always passes) so I'm thinking that the improvements we're already doing in 4.3 addresses the problem.
The issue I'm facing is that it's not trivial to pinpoint the solution and just cherry pick a commit or two. I'd like to ask you to try to test against the latest snapshot builds and see if it also works for you too.
In the meantime I'm looking at the code and see how we can be more thread safe, but I'd rather avoid adding synchronized blocks everywhere as this can become a bottleneck if not properly done.
Regarding the question about number of instances, the rule of thumb is that if the handlers are asynchronous (and all the ones you listed are), usually a single instance gives you the best performance, if blocking code is involved then it's a matter of trial and error to verify which number of instances gives the best results for your given application.
The small change above also ensures that even with blocking handlers, no ConcurrentModificationException
s occur
@pmlopes Thank you very much for your answers. I'll run my reproducer with the latest 4.3.0-SNAPSHOT (probably this evening) and I will let you know.
I've tried using the modified version of AuthorizationsImpl.java (with ConcurrentHashMap) in my "vert.x 4.2.5"-based reproducer (I've put the java file directly in the project) and, unfortunately, I still get ConcurrentModificationException and a new NullPointerException.
for (String providerId : authorizations.getProviderIds()) {
Set<Authorization> authorizationSet = authorizations.get(providerId);
StringBuilder builder = new StringBuilder();
for (Authorization authorization : authorizationSet) { // ConcurrentModificationException here
java.util.ConcurrentModificationException
at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1493)
at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1516)
at org.bfreuden.WebServer.lambda$protectedPageHandler$3(WebServer.java:120)
if (PermissionBasedAuthorization.create("do:stuff").match(user)) { // NullPointerException here
java.lang.NullPointerException
at io.vertx.ext.auth.authorization.impl.PermissionBasedAuthorizationImpl.match(PermissionBasedAuthorizationImpl.java:60)
at io.vertx.ext.auth.authorization.Authorization.match(Authorization.java:67)
at org.bfreuden.WebServer.lambda$protectedPageHandler$3(WebServer.java:114)
Hi @pmlopes,
I still have ConcurrentModificationException errors that seem to be gone with this modification:
public class AuthorizationsImpl implements Authorizations {
@Override
public Authorizations add(String providerId, Set<Authorization> authorizations) {
Objects.requireNonNull(providerId);
Objects.requireNonNull(authorizations);
ConcurrentHashMap.KeySetView<Authorization, Boolean> concurrentAuthorizations = ConcurrentHashMap.newKeySet();
authorizations.forEach(v -> concurrentAuthorizations.add(v));
getOrCreateAuthorizations(providerId).addAll(concurrentAuthorizations);
return this;
}
private Set<Authorization> getOrCreateAuthorizations(String providerId) {
return authorizations.computeIfAbsent(providerId, k -> {
ConcurrentHashMap.KeySetView<Authorization, Boolean> concurrentSet = ConcurrentHashMap.newKeySet();
return concurrentSet;
});
}
}
With your latests updates and that one, I can't seem to catch exceptions any more.
However I'm still getting 403 errors where I should have 200. I guess there is nothing we can do easily to prevent that: when a thread is clearing a collection (of permissions) before adding new ones, and when another threads is reading the collection between the clear and the add of the first thread, this is expected to happen.
I'm about to compile the vert.x 4.3 branch (because 4.3.0-SNAPSHOT jars are not accessible anywhere right?) to give a try.
I'll keep you posted.
Thanks again.
I'm realizing only the getOrCreateAuthorizations
has to be modified. Sorry (panic mode...)
To prevent the 403, we need some behavior change. We should make the API return immutable objects. This way multiple views will be independent at the cost of more garbage being produced.
I think this approach would be better and we could use simpler types and avoid synchronization.
I've given vert.x 4.3.0-SNAPSHOT a try (fresh clones of vert.x vertx-web and vertx-auth, master branch) and I do get ConcurrentModificationException as well, in addition to 403 errors:
java.util.ConcurrentModificationException
at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1493)
at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1516)
at org.bfreuden.WebServer.lambda$protectedPageHandler$3(WebServer.java:120)
at io.vertx.core.impl.future.FutureImpl$3.onSuccess(FutureImpl.java:141)
at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:60)
at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:211)
at io.vertx.core.impl.future.PromiseImpl.tryComplete(PromiseImpl.java:23)
at io.vertx.core.Promise.tryComplete(Promise.java:121)
at io.vertx.core.Promise.complete(Promise.java:77)
at org.bfreuden.WebServer.lambda$protectedPageHandler$1(WebServer.java:94)
at io.vertx.core.impl.future.FutureImpl$1.onSuccess(FutureImpl.java:91)
at io.vertx.core.impl.future.FutureImpl$ListenerArray.onSuccess(FutureImpl.java:262)
at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:60)
at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:211)
at io.vertx.core.impl.future.PromiseImpl.tryComplete(PromiseImpl.java:23)
at io.vertx.core.impl.future.PromiseImpl.onSuccess(PromiseImpl.java:49)
at io.vertx.core.impl.future.PromiseImpl.handle(PromiseImpl.java:41)
at io.vertx.core.impl.future.PromiseImpl.handle(PromiseImpl.java:23)
at io.vertx.ext.auth.properties.impl.PropertyFileAuthenticationImpl.lambda$getAuthorizations$8(PropertyFileAuthenticationImpl.java:213)
at io.vertx.ext.auth.properties.impl.PropertyFileAuthenticationImpl.getUser(PropertyFileAuthenticationImpl.java:161)
at io.vertx.ext.auth.properties.impl.PropertyFileAuthenticationImpl.getAuthorizations(PropertyFileAuthenticationImpl.java:203)
at org.bfreuden.DelayingAuthorization.lambda$getAuthorizations$0(DelayingAuthorization.java:31)
The problem is still here in 4.4.0.
It occurs from time to time with manual interactions (understand: not with my junit test hammering the web service): sometimes silently (an over-filtered - filtered by authz - or empty web service response), or sometimes with an plain 403 error.
I would be very happy if it was fixed :)
@vietj @pmlopes I have modified my reproducer to use an async lock every time the code is dealing with authorizations:
- clear authorizations
- write authorizations
- read authorizations
The async lock is added to the attributes of the User objects by an AuthenticationProvider
decorator called UserWithLockAuthenticationProvider
.
The WebServer
wil acquire the lock depending on the config of the verticle:
AsyncLock lock = (AsyncLock)user.attributes().getValue(UserWithLockAuthenticationProvider.LOCK_ATT_NAME);
Future<AsyncLock.LockToken> userLockAcquired;
if (lock == null) {
userLockAcquired = Future.succeededFuture(nopLock); // NOP dynamic proxy
} else {
userLockAcquired = Future.fromCompletionStage(lock.acquireLock());
}
You can activate the locking behavior by uncommenting that line in AppTest
:
@BeforeClass
public static void startServer() throws InterruptedException {
App.main(new String[0]);
// App.main(new String[] {"useAsyncLock"});
}
With such a lock, the errors disappear.
That's pretty "hardcore" though... I'm not sure if it should or how it could make its way into vert.x.
As is, it will spread a lot of boilerplate in user code.
maybe (?) by replacing this User
method:
public Authorizations authorizations();
with something like:
public static class AuthorizationsHandler<T> {
Future<T> handle(Authorizations authorizations);
}
public <T> Future<T> withAuthorizations(AuthorizationsHandler<T> handler);
Maybe that withAuthorizations
method could internally deal with the lock?