One of our objects has a cache (com.atlassian.cache.Cache) which is initialized on the object creation.
When the object is created on one node -- it sends OBJ_CREATED event to other nodes to let them know and to create the same objects on them as well.
Sometimes the object is changed and this leads to the cache invalidation which is done with "cache.removeAll()" call.
The issue is that when object update is performed just after the object creation then we get the next errors in logs:
====================================================
2019-01-30 12:53:57,961 localq-reader-14 WARN anonymous 636x48x1 1k8m4mi 92.255.190.27,172.31.59.203 /secure/SetupLicense.jspa [c.a.j.c.distribution.localq.LocalQCacheOpReader] Runtime exception: UnrecoverableFailure occurred when processing: LocalQCacheOp{cacheName='com.bigbrassband.jira.git.services.gitmanager.SingleGitManager_5.cache', action=REMOVE_ALL, key=null, value=null, creationTimeInMillis=1548852837937} from cache replication queue: [queueId=queue_i06d7b12498aecfbe4_4_766da1cf38b8faaaa3868bd585dd2264, queuePath=/var/atlassian/application-data/jira/localq/queue_i06d7b12498aecfbe4_4_766da1cf38b8faaaa3868bd585dd2264], failuresCount: 1/2, error: java.rmi.NotBoundException: com.bigbrassband.jira.git.services.gitmanager.SingleGitManager_5.cache
2019-01-30 12:53:57,973 localq-reader-14 WARN anonymous 636x48x1 1k8m4mi 92.255.190.27,172.31.59.203 /secure/SetupLicense.jspa [c.a.j.c.distribution.localq.LocalQCacheOpReader] Abandoning sending because cache does not exist on destination node: LocalQCacheOp{cacheName='com.bigbrassband.jira.git.services.gitmanager.SingleGitManager_5.cache', action=REMOVE_ALL, key=null, value=null, creationTimeInMillis=1548852837937} from cache replication queue: [queueId=queue_i06d7b12498aecfbe4_4_766da1cf38b8faaaa3868bd585dd2264, queuePath=/var/atlassian/application-data/jira/localq/queue_i06d7b12498aecfbe4_4_766da1cf38b8faaaa3868bd585dd2264], failuresCount: 2/2. Removing from queue. Error: com.bigbrassband.jira.git.services.gitmanager.SingleGitManager_5.cache
java.rmi.NotBoundException: com.bigbrassband.jira.git.services.gitmanager.SingleGitManager_5.cache
at sun.rmi.registry.RegistryImpl.lookup(RegistryImpl.java:227)
at sun.rmi.registry.RegistryImpl_Skel.dispatch(RegistryImpl_Skel.java:115)
at sun.rmi.server.UnicastServerRef.oldDispatch(UnicastServerRef.java:468)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:300)
at sun.rmi.transport.Transport$1.run(Transport.java:200)
at sun.rmi.transport.Transport$1.run(Transport.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
at sun.rmi.transport.StreamRemoteCall.exceptionReceivedFromServer(StreamRemoteCall.java:283)
at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:260)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:375)
at sun.rmi.registry.RegistryImpl_Stub.lookup(RegistryImpl_Stub.java:119)
at com.atlassian.jira.cluster.distribution.localq.rmi.BasicRMICachePeerProvider.lookupRemoteCachePeer(BasicRMICachePeerProvider.java:64)
at com.atlassian.jira.cluster.distribution.localq.rmi.BasicRMICachePeerProvider.create(BasicRMICachePeerProvider.java:39)
at com.atlassian.jira.cluster.distribution.localq.rmi.CachingRMICachePeerManager.getCachePeerFor(CachingRMICachePeerManager.java:58)
at com.atlassian.jira.cluster.distribution.localq.rmi.CachingRMICachePeerManager.withCachePeer(CachingRMICachePeerManager.java:91)
at com.atlassian.jira.cluster.distribution.localq.rmi.LocalQCacheOpRMISender.send(LocalQCacheOpRMISender.java:63)
at com.atlassian.jira.cluster.distribution.localq.LocalQCacheOpReader.run(LocalQCacheOpReader.java:84)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
====================================================
How to avoid these errors? Am I right that the cache syncing is stopped in that case (see "Abandoning sending" message)?
What we are doing wrong? How to proper invalidate the cache?
Thank you.
I am concerned that your stack trace errors here are making reference to the page
/secure/SetupLicense.jspa
To me, this indicate that this particular node might not yet have an appropriate license for Jira Software Data center and/or Jira Service Desk Data Center applied to it. Which is very strange, because if you're running multiple nodes in a data center cluster, each of these are expected to utilize the same database settings which contain the license.
Can you try to navigate directly to that Jira node and go to /secure/SetupLicense.jspa in a web browser to see if there are any errors/prompts etc.
Also what version of Jira Data Center is this?
Andrew, thank you for the answer.
It was a test instance of Jira Software Data Center on AWS, created according to https://www.atlassian.com/enterprise/data-center/aws and deployed to the existing our VPC.
I used the "10 user Jira Software Data Center license, expires in 3 hours" license taken from https://developer.atlassian.com/platform/marketplace/timebomb-licenses-for-testing-server-apps/ to test our plugin.
During the testing the license was not expired yet.
I'll setup the test environment again and try to perform the action you recommended.
Thank you for the help.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Andrew, I navigated to the /secure/SetupLicense.jspa URL on the node. The result is the next -- https://monosnap.com/file/xw7SkRmsCxeg9LaUCKcCz06sSzZcZW
I've investigated the issue a bit more. According to the logs and our code logic, it seems that the issue is the next.
Our code performs 2 sequential operations:
It seems that the error occurs when the interval between the operations is too short. I tested that by adding an explicit 10 seconds delay between these operations in our code. The errors completely disappeared.
But. Such a solution is not acceptable on a production system.
My concerns/questions are:
Thank you for your help.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
java.rmi.NotBoundException exception means that the other node is up but this cache is not available on the other node; it can be a temporary state (lets cover this later) or permanent (there is something wrong - let's cover this later).
Jira assumes remote caches are (and should) be created when the plugin starts.
So usually the code will look something like this:
@Component
public class MyComponent {
private final Cache<Serializable, Serializable> myRemoteCache;
public MyComponent(@ComponentImport final CacheManager cacheManager) {
this.myRemoteCache = cacheManager.getCache(
"my.remote.cache.name",
null,
new CacheSettingsBuilder()
.replicateViaInvalidation()
.build());
}
}
Let's have a 2 node cluster and a cache replication event on Node1. Let's see when could we see this exception.
First I would like to understand which case (from above) is what you are seeing.
You are saying that you are doing the following:
Please also have a look at this page to see the description of asynchronous cache replication in Jira: https://confluence.atlassian.com/enterprise/jira-data-center-cache-replication-954262828.html
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Replying to your questions:
- Does "Abandoning sending" message in errors mentioned in my question mean that cache replication is not performed for that cache?
No this means nodeX abandoned sending this particular message to nodeY. NodeX abandons sending message to NodeY only when it assumes this plugin has already started. I will still try to deliver next replication messages to this cache.
- If yes -- is it a permanent disabling of the replication for the particular cache or a disabling just for the specified operation? I.e. will subsequent cache changes be replicated?
Yes, disabling just for the specified operation.
- Are there any ways to avoid these warnings without explicit delays? I'm afraid that such errors in customer logs will cause useless stress to our and Atlassian support.
explicit delays are definitely not needed; more details in answer above;
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Maciej, thank you a lot for the detailed description of the Jira cache behavior.
It seems that our case is 4.2 in your list above.
As I wrote in the initial question description, the cache is bound to our object which is created at runtime. When the object is created it sends our OBJ_CREATED event to other nodes to let them know and to create the same objects on them as well (and so the objects on other nodes create their own caches). Thus the detailed list of operations for a 2-node cluster is:
The error is on step 4. A delay after the step 2 eliminates the error.
Thus it seems that our issue is on the step 3 which may occur after the step 4 due to the async nature of events and cache replication. Are the events and cache replications send through the same queue? Our assumption was that event and subsequent cache replications are received on another node in the same order as they generated on a transmitting node. Is it a wrong assumption?
If you are doing those 2 operations in sequence as I understand you are using caches replicated via copy
Our cache creation code:
cacheFactory.getCache(<Unique cache name generation>,
new DataLoader(),
new CacheSettingsBuilder()
.expireAfterAccess(1, TimeUnit.DAYS)
.flushable()
.maxEntries(cacheSize)
.build());
CacheSettingsBuilder by default creates a cache with replicateViaInvalidation, so we do not use replication via copy.
Summarizing your answer we should:
Is the above summary correct?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
- Node 1 creates an object with the cache
- Node 1 sends an OBJ_CREATED event
- Node 2 receives the OBJ_CREATED event and creates the corresponding object with the cache
- Node 1 invalidates the cache by removeAll() call
There is one missing step. How do you guarantee that step 3 is done before step 4?
Those operation are asynchronous. In step 2 you send OBJ_CREATED event (via ClusterMessagingService?). If you would like to be sure in step 4 that step 3 was done you would need to send a OBJ_CREATED_ACK from node 2 to node 1.
Then take into account having more nodes. You would need to track the number of nodes. Now on top of it add handling misbehaving nodes (long GC, networking problems, etc...) With all those non-trivial mechanism in place you would basically re-introduce synchronous operations across all nodes in the cluster which simply can't work on a production cluster.
Our assumption was that event and subsequent cache replications are received on another node in the same order as they generated on a transmitting node. Is it a wrong assumption?
cache replications from a single thread are guaranteed to be received on another node in the same order; but cluster events (via ClusterMessagingService?) are send via another channel (DB) so there is no guarantee in the ordering of cluster messages and cache replication messages.
- Either use a single cache created on the plugin startup (instead of the cache bound to a particular object).
- Or use some syncing mechanism (like cluster locks) to be sure that cache has been created on all nodes before any changes to the cache.
I really discourage option 2. Synchronous operations across cluster will eventually fail in a way it will affect the whole cluster (High Availability is gone).
Definitely go with option 1.
I don't really know your use case but I am curious about why would you want to run cache.removeAll() on a cache replicated via invalidation.This is a cache of values from a shared resource between nodes (database, shared folder file?)
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
via ClusterMessagingService?
Yes, of course.
I don't really know your use case but I am curious about why would you want to run cache.removeAll() on a cache replicated via invalidation.This is a cache of values from a shared resource between nodes (database, shared folder file?)
Yes, this cache contains values from a shared resource. When the resource parameters are changed -- we call cache.removeAll() to be sure that there are no incorrect values in the cache.
The error with NotBoundException we caught when the resource parameters were changed just after the resource creation.
We may try to avoid such cases on creation but it does not guarantee that in some circumstances the error comes back again. For example, if a user updates the resource parameters too quickly after creation while the creation event delivery will be delayed for some reasons.
Thus I've asked here to find the most appropriate way of the cache use.
Thank you a lot for the help in this case understanding!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Online forums and learning are now in one easy-to-use experience.
By continuing, you accept the updated Community Terms of Use and acknowledge the Privacy Policy. Your public name, photo, and achievements may be publicly visible and available in search engines.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.