J2EE Applications and Clusters
June 19, 2008
Author: Fazal Gupta
Issues in porting existing Applications on Clustered Environments
The ideas mentioned below are based on the practical experience of porting an existing application over a clustered environment. The work around and the solutions mentioned below should not be taken as “best practice guidelines” but possible quick fixes which may be needed to make sure that the application can run in a stable fashion without causing serious issues though with some trade off on scalability.
Environmental Setup of Cluster
The typical cluster being discussed here is not application severs cluster where the App servers are started in a clustered mode. Here multiple instances of App servers were started on different machines connecting to the same DB and a load balancer was placed in front of them for receiving requests and forwarding it to one of the App Server instances.
Also session stickiness instead of session replication was preferred in this case based on the application’s requirements. Hence one user’s interaction with the system is always handled by one node only until he logs out. This can simplify lot of clustering issues and is easier to implement. The drawback of going with session stickiness is the load among nodes can be greatly skewed as one server may end up serving many more users and would cater to greater load than other nodes. Thus the real benefits of clustering may not appear in some cases with session stickiness. Still based on the domain where the application is in use, session stickiness may not be a bad way to implement the clustering solution.
Problem of Pocket Caches
The idea of maintaining some kind of cache in java code is to increase performance. Unfortunately such pocket caches in java classes become a pain in the clustered environment. The issue one faces is maintaining these caches in sync when a change is made on one of the nodes and all other nodes need to update itself for the latest value. Typically these caches are built on some information from DB itself so as to minimize the DB hits for better performance.
Possible Solutions
1. Terracotta is one of the obvious solutions to this problem. I would not waste time in copying information about terracotta in this article and interested readers can refer Terracota Site site to know more about this. Unfortunately in our case we could not use terracotta because Oracle App Server support is yet not integrated in terracotta and Oracle App Server users would need to weave in OC4J classloader hierarchy within terracotta to make it work. I think Terracotta provides hooks for achieving this but in our case we did not have so much time for researching the solution and testing the changes. But if one wants to try this look further at Teracotta Forum
2. There are 3rd party caches like EH Cache available in the market which one can use to convert there caches to be cluster aware. Only issue is the amount of code change required in existing applications is quite significant and would normally be followed by a proper release of the product after good amount of testing. Hence this would again not qualify as a quick fix solution.
3. Use the common Database: This is a quick fix solution but definitely has a major performance drawback. The idea is to make use of a date column of a table related to the cached entity, and use the data of the column to determine whether the cache is stale or not. The drawback of this approach is obvious. To make sure that user always sees the correct value, every time value is looked up in cache one would need to do a DB hit to check for the data value. One can be smarter and try some small optimizations in this approach. One idea can be to keep some heuristic time gap before one goes to the DB to check whether the cache is stale or not. Thus between these intervals cache lookup would not involve any DB hit. The time interval can be decided on various factors like the type of application, how long is the stale data state acceptable and what is the probable interval within with some change in the cache is expected. Another optimization one can bring into this solution if the latency factor cannot be tolerated is to make the DB hit for a given request only once. This idea applies only to those caches which get looked up multiple times within one request scope. Thus one can set some parameters to make sure that the DB hits happens only once for that request.
4. Use some kind of Messaging between the nodes: Since DB itself can become a contentious resource in a clustered environment, sometimes it might be a good idea to relieve DB of the responsibility of doing sync up of pocket caches. Thus one can apply some alternative tricks to achieve this. One crude way to achieve this would be to send a sample HTTP request between the nodes whenever the cache is updated on one of these nodes. Apart from relieving the DB from the sync up work, this approach can also benefit in applying diff over the cache i.e. One can directly send the delta change in the message which other nodes can directly consume to update there caches. Further JMS can also be explored as an alternative messaging mechanism between the nodes. The biggest drawback of taking this approach is managing the communication between nodes. If the number of nodes is large such cross talk itself may hog up lot of traffic within the cluster and lower the overall performance itself. Hence this solution should be implemented with proper design to make sure that cluster can bail out of such scenarios gracefully while still achieving the sync up.
Recheck Code for possible Concurrency Issues
Clustering is to solve issue of scalability and allowing more users to concurrently access the application. Hence one can expect concurrency issues to become much more visible in a clustered application. Also some code pieces work fine when all the users connect to the same Application Server but with multiple instances of App server in place such methods need to be checked again to make sure that the Business logic works well in this case.
Revisit Synchronized methods
Consider the following piece of code
Synchronized void a() {
//Some select query on DB
// Process the returned value
//Update the new value of the record to DB
}
Take the case of above method. Since the method is synchronized, one can be sure that no two threads on this server would execute this method at the same time. Hence the task of reading and writing to DB for a given record is being done only 1 thread. But as soon as this application is ported to a cluster the above method lands up in trouble. Issue is obvious; two threads on two different nodes can be in the method at the same time and can end up playing with data integrity if both are accessing the same row. The solution to this problem is to execute the above piece of code within a transaction block and setting the right isolation level of the transaction. By default the driver normally return TRANSACTION_READ_COMMITTED as the level of isolation which is good enough for most of the cases. But the case mentioned above may need a stricter level of isolation like TRANSACTION_REPEATABLE_READ.
The choice of isolation level depends upon the scenario in which this method is being used. Another alternative to resolve the problem can be to put the above code within a transaction scope and making a SELECT FOR UPDATE query to the DB. This would attain a row level lock until a commit or rollback happens on the connection and other connection would wait before accessing this row or throw back an SQL Exception ( depending on the lock timeout parameters either at the sql level WAIT or NOWAIT option or the db settings like locktimeout). I am not an expert in DB stuff but these are some of the options that have worked well for us. Interested people can look these concepts in detail on the net.
Integrate time sync
While running nodes in clustered mode, it is very important that the system times of both the nodes are in sync. Though everyone seems to know the advantages of doing so, it is surprising how many systems do not have an automated time sync facility. Absence of time sync could show weird issues ranging from the clustered cache behaving in unpredictable ways to some frameworks like Quartz trying false cluster recoveries.
When in doubt use a Thread safe Collection
Though not related to to clusters this is a typical concurrency issue which many people underestimate or ignore. The use of Thread safe collections is of utmost importance to ensure consistent behavior for a concurrent application. The reason for mentioning this point here is because we mostly go for a clustered solution only when the concurrent traffic increases, more concurrent traffic means more chances of exposing unsafe collection issues. Thus for any Collection which is expected to be modified by different threads at the same time (though seemingly they may not be playing with the same record or value), its better to choose a thread safe collection as debugging concurrency issues especially in Collections is really hard and can be highly inconsistent. In one of our future articles we will cover more about usage of Thread safe collections and the possible concurrency issues related to them.
I hope this article helps some developers in future to resolve few small but painful issues in running their application over a cluster.
Entry Filed under: performance. Tags: Concurrency, Java.
4 Comments Add your own
Leave a Comment
Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>
Trackback this post | Subscribe to the comments via RSS Feed
1.
Darryl Stoflet | June 22, 2008 at 5:20 am
Interesting writeup. On the ’stickiness’ issue; most LB support a ‘least conn’ algorithm that that will help to avoid skewing of the load (to some degree, tho its obviously far from perfect since its a brainless least conn, that is it does not differentiate between load types).
I am curious though why you did not discuss session replication in more detail. Every meaningful appserver offers session replication which ameliorates two problems here, one the skewing due to session sticky, and two, a user does not lose their session if the app server they were on goes down.
As far as caching goes, I assume from your example that you are primarily talking about caching DB data. Ehcache has a very nice distributed replicated syncing cache. Typically most DB acces is done via an ORM or JPA. Hibernate, my favorite ORM/JPA allows ehcache to be plugged in transparently, and enabling the distributes cache is transaparent at the code level. So I’d be interested in understanding why you stated that the usage of ehcache would require to much rework.
2. mrozewski.pl » Blog&hellip | July 8, 2008 at 11:24 pm
[...] Java. Podają też swoje propozycje rozwiązań. Mnie w szczególności zainteresował artykuł J2EE Aplications and Clusters. Właśnie z podobnymi problemami walczymy ostatnio w pracy. W pierwszej fazie aplikacja była [...]
3.
Scott | August 6, 2008 at 6:41 am
I found the same as Darryl regarding Hibernate and Ehcache. We used Memcached with Ehcache, and quickly had our Hibernate objects in a distributed cache. We’ve had this running in production for a few months now with no issues.
We noticed a very nice performance increase as well.
The short version of the story is, I love Memcached
4.
Sergio Stateri Jr | August 22, 2008 at 4:35 pm
Hi Scot,
How can you to make hibernate cache use Memcached ? Did you implement a Hibernate cache provider ?
thanks in advance.