/ Java EE Support Patterns

7.30.2011

Weblogic authentication denied – WLS 10

Problem description

The Weblogic admin server and / or managed server(s) are unable to start properly and throwing an authentication denied error message.

Weblogic errors observed

Error #1

<Jul 30, 2011 3:13:18 AM PST> <Critical> <WebLogicServer> <BEA-000386> <Server subsystem failed.

Reason: weblogic.security.SecurityInitializationException: Authentication denied: Boot identity not valid;
The user name and/or password from the boot identity file (boot.properties) is not valid.
The boot identity may havenbeen changed since the boot identity file was created. Please edit and update
the boot identity file with the proper values of username and password. The first time the updated boot identity file
is used to start the server, these new values are encrypted.

weblogic.security.SecurityInitializationException: Authentication denied: Boot identity not valid;
The user name and/or password from the boot identity file (boot.properties) is not valid.
The boot identity may have been changed since the boot identity file was created. Please edit and update the boot identity file
with the proper values of username and password. The first time the updated boot identity file is used to start the server,
these new values are encrypted.

at weblogic.security.service.CommonSecurityServiceManagerDelegateImpl.doBootAuthorization(Unknown Source)
at weblogic.security.service.CommonSecurityServiceManagerDelegateImpl.initialize(Unknown Source)
at weblogic.security.service.SecurityServiceManager.initialize(Unknown Source)
at weblogic.security.SecurityService.start(SecurityService.java:141)
at weblogic.t3.srvr.SubsystemRequest.run(SubsystemRequest.java:64)
Truncated. see log file for complete stacktrace

Error #2

<Jul 30, 2011 5:11:55 AM PST> <Critical> <Security> <BEA-090403> <Authentication for user <user> denied>
<Jul 30, 2011 5:11:55 AM PST> <Critical> <WebLogicServer> <BEA-000386> <Server subsystem failed.
Reason: weblogic.security.SecurityInitializationException: Authentication for user <user> denied
weblogic.security.SecurityInitializationException: Authentication for user <user> denied

at weblogic.security.service.CommonSecurityServiceManagerDelegateImpl.doBootAuthorization(Unknown Source)
at weblogic.security.service.CommonSecurityServiceManagerDelegateImpl.initialize(Unknown Source)
at weblogic.security.service.SecurityServiceManager.initialize(Unknown Source)
at weblogic.security.SecurityService.start(SecurityService.java:141)
at weblogic.t3.srvr.SubsystemRequest.run(SubsystemRequest.java:64)
Truncated. see log file for complete stacktrace

Possible root causes and solutions


Root cause #1

The Weblogic boot.properties file is corrupted or contains invalid principal and credentials

Solution >> boot.properties reset

·         Backup and clear the cache and data directories under <WL Domain>/servers/<Admin & Managed server>
·         Recreate boot.properties (put back your plain text username and password) under <WL Domain>/servers/<Admin & Managed server>/security directory and restart the affected server(s)

Root cause #2

The Weblogic boot.properties file is valid but the security realm is corrupted or in an invalid state

Solution >> Weblogic Admin username and password reset

·         Backup your Weblogic server domain
·         Rename or delete <WL Domain>/security/DefaultAuthenticatorInit.ldift
·         Run the following Java command:
        java weblogic.security.utils.AdminAccount <new-admin-user-name> <new-admin-user-pwd> <<WL Domain>/security >
·         Delete the contents inside the file boot.properties under <WL Domain>/servers/< AdminServer>/security
·         Add the following contents inside the boot.properties
        username=<new-admin-user-name>
        password=<new-admin-user-pwd>
·         Backup and delete the folder: <WL Domain>/servers/<AdminServer>/data/ldap
·          Restart your Weblogic server 

Free Thread Dump analyzer

This post is to inform you that I currently offer free Thread Dump analysis via email or real-time case study from my Blog. I currently have 10 years of experience with Java EE and Thread Dump analysis from different JVM vendors (Sun, IBM, JRockit etc.).

A JVM Thread Dump provides crucial information on your Java EE server (Weblogic, JBoss, WAS etc.) and application Thread processing and health. Thread dump analysis is one of the most important skills to acquire for any Java EE production support person.

Why is my consulting service free? My goal is to simply create a solid knowledge base of Thread Dump problem patterns on my Blog and share with Java EE production support individuals.

Thread Dump analysis – a simple 3 steps process

1)       Email me @phcharbonneau with your Thread Dump snapshot data (captured during your incident). Please also provide the specifications of your environment (Java EE server vendor & version, Java VM vendor and version etc.) and a high level description of your problem
2)       Wait for my initial reply. I will do my best to provide you with a quick answer in less than 24 hours along with possible root causes
3)       With your approval, I may create a Thread Dump problem pattern case study on my Blog to share your Java EE problem along with its resolution to my other readers with no detail on your application name, code or client information

Thank you and looking forward to hear from your problems!

Regards,
P-H

Free Thread Dump analysis

This post is to inform you that I currently offer free Thread Dump analysis via email or real-time case study from my Blog. I currently have 10 years of experience with Java EE and Thread Dump analysis from different JVM vendors (Sun, IBM, JRockit etc.).

A JVM Thread Dump provides crucial information on your Java EE server (Weblogic, JBoss, WAS etc.) and application Thread processing and health. Thread dump analysis is one of the most important skills to acquire for any Java EE production support person.

Why is my consulting service free? My goal is to simply create a solid knowledge base of Thread Dump problem patterns on my Blog and share with Java EE production support individuals.

Thread Dump analysis – a simple 3 steps process

1)       Email me @phcharbonneau with your Thread Dump snapshot data (captured during your incident). Please also provide the specifications of your environment (Java EE server vendor & version, Java VM vendor and version etc.) and a high level description of your problem
2)       Wait for my initial reply. I will do my best to provide you with a quick answer in less than 24 hours along with possible root causes
3)       With your approval, I may create a Thread Dump problem pattern case study on my Blog to share your Java EE problem along with its resolution to my other readers with no detail on your application name, code or client information

Thank you and looking forward to hear from your problems!

Regards,
P-H

7.26.2011

Weblogic PermGen space

This article will provide you some background on the Java HotSpot PermGen space and a 4 steps process on how you can monitor and tune your PermGen space for your Weblogic Java EE server.

Java HotSpot PermGen overview

The HotSpot PermGen (Permanent Generation) is a memory space specific to the Sun Java HotSpot VM and also applicable for any other Java EE server using the HotSpot VM such as Red Hat JBoss or IBM WAS. It is a special memory space for Class descriptors and other internal JVM objects.

Please refer to my other Blog article on the Java HotSpot VM memory space breakdown for a quick visual overview of all HotSpot memory spaces.

If you are facing any PermGen related problem such as OutOfMemoryError, please refer to my Blog article OutOfMemoryError PermGen patterns for description and troubleshooting of the most common Weblogic PermGen issues.

PermGen tuning step #1 – Identify your JVM vendor

First, please make sure that PermGen space is actually applicable for your Java EE environment. JVM’s such as IBM SDK and Oracle JRockit do not have an actual PermGen space. For those VM’s, the memory is allocated either to the Java Heap or the native C Heap.

** PermGen space is only applicable for the Sun Java HotSpot VM 1.4, 1.5 & 1.6. There are actually some requests for enhancements for JDK 1.7 that may actually eliminate the PermGen and use the native C memory instead  for the VM's representation of the Java program (Class descriptors / code).  **

PermGen tuning step #2 – Review your current PermGen space settings

The minimum and maximum PermGen space configuration is achieved via the following parameters:

## Example: PermGen space with a minimum size of 256 MB
-XX:PermSize=256m

## Example: PermGen space with a maximum size of 256 MB
-XX:MaxPermSize=512m

The default Weblogic PermGen space settings are normally found from the commEnv.sh / commEnv.cmd scripts and located under:

<WL11g HOME>/wlserver_10.3/common/bin/




When using Weblogic Nodemanager to start and stop your managed servers (used by most Weblogic production systems), PermGen settings can be updated via the Weblogic console under >>> Home / Managed server 1...n / Server Start / Arguments field:


PermGen tuning step #3 – Review your current PermGen runtime footprint

PermGen space capacity and footprint can be analyzed by turning on verbose GC and monitoring the Full GC collections on a daily basis. This will allow you to determine your current PermGen footprint and identify any potential issue such as PermGen memory leak.

Verbose GC for the HotSpot VM can be enabled by simply adding the parameters below to your start-up VM parameters:

## Example: Enable verbose GC and print the output to a custom log
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/apps/domains/app/gc/gc_s1.log


You can also use the free open source tool GC Viewer to graphically review your Java Heap memory footprint from the GC output data.

A bigger PermGen space requirement is to be expected for applications using a significant amount of code and class-loader trees e.g. many EAR files and / or applications using frameworks that depend heavily on dynamic class-loading / Refection API.

PermGen tuning step #4 – Tune your PermGen space, if required

Increasing your maximum capacity of the PermGen space is often required but you first need to ensure that you are not facing a memory leak such as class-loader leak. Please review your verbose GC output very closely and ensure that it is not growing constantly on a daily basis. You may need several days to confirm a PermGen memory leak since applications such Weblogic portal using dynamic JSP reload may increase the PermGen utilization over time (until all your JSP’s are compiled; loaded to their class-loader).

In order to increase your PermGen space to let’s say 512 MB, simply edit the Weblogic commEnv.sh or commEnv.cmd script and change the default from 128m to 512m or add / modify the MaxPermSize to 512m (-XX:MaxPermSize=512m) via the Server Start arguments (as per step #2 Weblogic console snapshot) if using Nodemanager.

** A Weblogic managed server(s) restart will be required in order to activate your new PermGen space settings. **

7.24.2011

ORA-01115 IO error reading solution

This case study describes the complete root cause analysis and resolution of an Oracle 10gR2 database problem (ORA-01115) affecting a Weblogic Integration 9.2 Java EE application.

Environment specifications

-          Java EE server: Oracle Weblogic Integration 9.2 MP2
-          Database server: Oracle 10gR2
-          Middleware OS: Sun Solaris 5.10
-          Database server OS: IBM AIX 5.3 TL5
-          JDK: Sun Java HotSpot(TM) Server VM Version 1.5.0_10-b03
-          Platform type: BPM

Problem overview

-      Problem type: java.io.IOException: [BEA][Oracle JDBC Driver][Oracle]ORA-01115: IO error reading block from file 30 (block # 255462)

An IO Exception error was observed from the Weblogic Integration server logs. This Exception was thrown from the remote Oracle database when trying to insert data within WLI internal tables.

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

·       What is the client impact? HIGH (full outage of our application)
·       Recent change of the affected platform? No
·       Any recent traffic increase to the affected platform? No
·       Since how long this problem has been observed?  This error suddenly appeared for the first time early in the morning
·       What is the health of the Oracle database server? Our Oracle DBA did confirm a problem regarding access to one of our Oracle data file (/oracle/db/oracle_data01.dbf')
·       Did a restart of the Weblogic Integration server resolve the problem? No

·         Conclusion #1: The problem seems to be isolated on the Oracle database server side and related to a possible IO problem accessing one of the Oracle internal data files.

Weblogic log file analysis: ORA-01115!

A first analysis of the problem was done by reviewing the Weblogic Integration managed server log errors.

java.io.IOException: [BEA][Oracle JDBC Driver][Oracle]ORA-01115: IO error reading block from file 30 (block # 255462)
ORA-01110: data file 30: '/oracle/db/oracle_data01.dbf'
ORA-27091: unable to queue I/O
ORA-27072: File I/O error
IBM AIX RISC System/6000 Error: 110: Media surface error
Additional information: 7
Additional information: 255461
Additional information: -1

        at weblogic.jdbc.base.BaseBlobOutputStream.write(Unknown Source)
        at weblogic.jdbc.base.BaseBlobOutputStream.write(Unknown Source)
        at com.bea.wli.knex.runtime.core.bean.BMPContainerBean$OracleTableAccess.doStoreByInsert(BMPContainerBean.java:860)
        at com.bea.wli.knex.runtime.core.bean.BMPContainerBean.doInsert(BMPContainerBean.java:1835)
        at com.bea.wli.knex.runtime.core.bean.BMPContainerBean.ejbStore(BMPContainerBean.java:1787)
        at com.bea.wli.bpm.runtime.ProcessContainerBean.ejbStore(ProcessContainerBean.java:204)
        at com.bea.wli.bpm.runtime.ProcessContainerBean.ejbStore(ProcessContainerBean.java:191)
        at com.bea.wlwgen.PersistentContainer_sq78z1_Impl.ejbStore(PersistentContainer_sq78z1_Impl.java:191)
        at weblogic.ejb.container.manager.ExclusiveEntityManager.beforeCompletion(ExclusiveEntityManager.java:609)
        at weblogic.ejb.container.manager.ExclusiveEntityManager.beforeCompletion(ExclusiveEntityManager.java:580)
        at weblogic.ejb.container.internal.TxManager$TxListener.beforeCompletion(TxManager.java:1028)
        at weblogic.transaction.internal.ServerSCInfo$CallBeforeCompletionsAction.run(ServerSCInfo.java:1529)
        at weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:363)
        at weblogic.security.service.SecurityManager.runAs(SecurityManager.java:147)
        at weblogic.transaction.internal.ServerSCInfo.doBeforeCompletion(ServerSCInfo.java:1161)
        at weblogic.transaction.internal.ServerSCInfo.callBeforeCompletions(ServerSCInfo.java:1143)
        at weblogic.transaction.internal.ServerSCInfo.startPrePrepareAndChain(ServerSCInfo.java:115)
        at weblogic.transaction.internal.ServerTransactionImpl.localPrePrepareAndChain(ServerTransactionImpl.java:1288)
        at weblogic.transaction.internal.ServerTransactionImpl.globalPrePrepare(ServerTransactionImpl.java:2100)
        at weblogic.transaction.internal.ServerTransactionImpl.internalCommit(ServerTransactionImpl.java:259)
        at weblogic.transaction.internal.ServerTransactionImpl.commit(ServerTransactionImpl.java:227)
        at weblogic.ejb.container.internal.MDListener.execute(MDListener.java:463)
        at weblogic.ejb.container.internal.MDListener.transactionalOnMessage(MDListener.java:335)
        at weblogic.ejb.container.internal.MDListener.onMessage(MDListener.java:291)
        at weblogic.jms.client.JMSSession.onMessage(JMSSession.java:4072)
        at weblogic.jms.client.JMSSession.execute(JMSSession.java:3962)
        at weblogic.jms.client.JMSSession$UseForRunnable.run(JMSSession.java:4490)
        at weblogic.work.ServerWorkManagerImpl$WorkAdapterImpl.run(ServerWorkManagerImpl.java:518)
        at weblogic.work.ExecuteThread.execute(ExecuteThread.java:209)
        at weblogic.work.ExecuteThread.run(ExecuteThread.java:181)

As you can see, the ORA-01115 error was thrown (along with ORA-01110 ORA-27091 ORA-27072) during the insert operation of WLI data. An AIX related error was also thrown (Media surface error).

-          Conclusion #2: The problem seems to be related to a physical inaccessibility (OS and / or disk) of this critical Oracle data file. A logical file corruption problem was also kept in mind as a possible root cause.

OS problem replication: The media surface is damaged!

After the initial log review and problem isolation, an AIX system admin was engaged to further troubleshoot this problem.  A simple copy of the affected Oracle data file was attempted:

:> cp /oracle/db/oracle_data01.dbf .
cp: /oracle/db/oracle_data01.dbf: The media surface is damaged.


This simple problem replication step did confirm that the AIX OS also had lost control of the file.
The following AIX command was also executed in order to review the AIX OS error report:

errpt –a
…………………………………………………………………………………..
Description
DISK OPERATION ERROR

Probable Causes
MEDIA

User Causes
MEDIA DEFECTIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES

Failure Causes
MEDIA
DISK DRIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES
…………………………………………………………………………………..

-          Conclusion #3: At this point, it was determined that the problem was very likely related to a physical problem with the SAN / RAID cluster. A technician was dispatched to perform further analysis of our RAID (Redundant Array of Independent Disks) environment.

Root cause: RAID disk failure!

Health check of the RAID environment did confirm that one of the 7 available disks failed earlier in the morning; just a few minutes before the first occurrence of the ORA-01115 error in the Weblogic logs.

Unfortunately, the disk failure triggered corruption of the RAID parity; affecting one of our Oracle data file. New files could be written to the RAID (6 disks left) but some files were not available due to parity corruption.

A disk failure has negative impact on parity. A disk can fail for various reasons such as the disk may be overused and due to excess variation in the power supply, it may result in the failure of the disk. Failure to replace it immediately can cause adverse affects on the system performance and may even damage the disks which are used in backup. The RAID parity provides protection against drive failures but if it encounters a parity error, it will start writing a bad data on the drive.

In our case, the technician was able to resolve the parity corruption without having to restore any data from the backup tapes. Resolution of the parity automatically allowed Oracle to gain back control of the affected date file; without any Oracle listener or database restart.

A complete Weblogic restart was performed as the final resolution step in order to re-initialize the affected WLI JDBC Connection pools.

Conclusion and recommendations

-          When facing ORA-01115 with Oracle, please ensure to do a fast and complete root cause analysis
-          Avoid taking unnecessary and non efficient resolution steps such as an early Weblogic restart or Oracle database restart

5.09.2011

SocketInputStream hang defaultRowPrefetch tuning

 This case study will provide you with background on this type of problem and demonstrates how you can improve your performance for some of your SQL queries by tuning the Oracle JDBC driver defaultRowPrefetch value.

Environment specifications (case study)

·         Java EE server: Oracle Weblogic Server 10.3.3
·         Hardware: HP DL360 – 2 x Intel Xeon Processor x5670 (6 cores) @2.93 GHz & 72 GB RAM
·         OS: 64-bit Red Hat Enterprise Linux (RHEL) version 5.5
·         JDK: Sun 64-bit JDK/JRE 1.6.0_11+
·         RDBMS: Oracle 10gR2
·         Database Persistence API: Oracle Toplink version 11.1.1.3.0
·          Platform type: Middle tier platform


Monitoring and troubleshooting tools

-          UNIX / Linux traceroute command
-          Oracle AWR Report (to access health of the database and SQL response time)

Problem overview

-          ·        Problem type: Oracle JDBC driver Socket read hang and slower SQL response time from one regional location of our Weblogic application server

Higher response time was observed from one of one Weblogic application server location. Our environment is built as per below:

-          The application is deployed across 2 different regional locations (region #1 & region #2) with one Weblogic cluster deployed in each region
-          The primary Oracle database server is deployed to the region #1
-          The cold standby Oracle database (used for fail-over only) is deployed to the region #2
-          The Weblogic cluster #1 is deployed to the region #1
-          The Weblogic cluster #2 is deployed to the region #2

Major slowdown was observed for SQL transactions executed from region #2 (cross regional communication with the database).

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

·         What is the client impact? MEDIUM (higher response time observed in an intermittent manner; depending on which region the user is hitting)
·         Is slowdown affecting all transactions? No, slowdown was isolated to transactions executed from Weblogic cluster #2; involved in a cross regional communication with our Oracle database
·         Did DBA investigate the health of the database and SQL performance? Yes, no problem was found with the database and no slow running SQL was found from the Oracle AWR report

-          Conclusion #1: The problem appear to be isolated to transactions executed from Weblogic cluster #2 which is communication with the primary Oracle database server located in the region #1. The decision was then taken to assess any network latency problem between region #1 and region #2


Network latency analysis

Network latency between the Weblogic clusters and the Oracle database server was measured at high level via the traceroute command.

Latency between Weblogic cluster #1 (region #1) and database server (region #1)

bash-2.05$ traceroute <server hostname>

traceroute: Warning: Multiple interfaces found; using xyz.xyz.xyz.xyz @ eri0
traceroute to <server hostname> (xyz.xyz.xyz.xyz), 30 hops max, 40 byte packets
1         lt;server hostname> (xyz.xyz.xyz.xyz)  0.228 ms  0.151 ms  0.335 ms



** Latency of ~0.3 ms with no HOP in between **

Latency between Weblogic cluster #2 (region #2) and database server (region #1)
$ traceroute <db server hostname>
traceroute: Warning: Multiple interfaces found; using xyz.xyz.xyz.xyz @ eri0
traceroute to <db server hostname> (xyz.xyz.xyz.xyz), 30 hops max, 40 byte packets
1  <router hostname> (xyz.xyz.xyz.xyz)  1.854 ms  1.786 ms  1.460 ms
2  <router hostname> (xyz.xyz.xyz.xyz)  0.298 ms  0.284 ms  0.322 ms
3  <router hostname> (xyz.xyz.xyz.xyz)  0.648 ms  0.389 ms  0.414 ms
4  <router hostname> (xyz.xyz.xyz.xyz)  0.294 ms  0.562 ms  0.330 ms
5  <router hostname> (xyz.xyz.xyz.xyz)  0.695 ms  0.837 ms  0.695 ms
6  <router hostname> (xyz.xyz.xyz.xyz)  10.180 ms  10.173 ms  10.039 ms
7  <router hostname> (xyz.xyz.xyz.xyz)  20.091 ms  19.767 ms  20.157 ms
8  <router hostname> (xyz.xyz.xyz.xyz)  9.732 ms  9.908 ms  9.464 ms
9  <db server hostname> xyz.xyz.xyz.xyz  9.814 ms  10.238 ms  9.765 ms


** Latency of ~50 ms with 8 HOPS (router, switch etc.) in between **


Graphical view of the network latency between region #1 and region #2

-          Conclusion #2: A higher network latency was confirmed between Weblogic cluster #2 and database server (region #1). Now question was to determine why such latency was causing so much impact on the SQL response time


Thread Dump analysis

A few Thread Dump snapshot were taken from Weblogic cluster #2 which did reveal the following bottleneck:

ExecuteThread: '61' for queue: 'weblogic.kernel.Default'" daemon prio=5 tid=0x00970d58 nid=0x51 runnable [a8b7e000..a8b7fc28]
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at oracle.net.ns.Packet.receive(Unknown Source)
at oracle.net.ns.NetInputStream.getNextPacket(Unknown Source)
at oracle.net.ns.NetInputStream.read(Unknown Source)
at oracle.net.ns.NetInputStream.read(Unknown Source)
at oracle.net.ns.NetInputStream.read(Unknown Source)
at oracle.jdbc.ttc7.MAREngine.unmarshalUB1(MAREngine.java:718)
at oracle.jdbc.ttc7.MAREngine.unmarshalSB1(MAREngine.java:690)
at oracle.jdbc.ttc7.Oall7.receive(Oall7.java:373)
at oracle.jdbc.ttc7.TTC7Protocol.doOall7(TTC7Protocol.java:1405)
at oracle.jdbc.ttc7.TTC7Protocol.fetch(TTC7Protocol.java:889)
- locked <0xc7660c90> (a oracle.jdbc.ttc7.TTC7Protocol)
at oracle.jdbc.driver.OracleResultSetImpl.next(OracleResultSetImpl.java:242)
- locked <0xbaaa51d0> (a oracle.jdbc.driver.OracleResultSetImpl)
at weblogic.jdbc.wrapper.ResultSet_oracle_jdbc_driver_OracleResultSetImpl.next(Unknown Source)
at oracle.toplink.internal.databaseaccess.DatabaseAccessor.executeCall(Unknown Source)
at oracle.toplink.threetier.ServerSession.executeCall(Unknown Source)
at oracle.toplink.internal.queryframework.CallQueryMechanism.executeCall(Unknown Source)
at oracle.toplink.internal.queryframework.CallQueryMechanism.executeCall(Unknown Source)
at oracle.toplink.internal.queryframework.CallQueryMechanism.executeSelectCall(Unknown Source)
at oracle.toplink.internal.queryframework.CallQueryMechanism.selectAllRows(Unknown Source)
at oracle.toplink.internal.queryframework.ExpressionQueryMechanism.selectAllRowsFromTable(Unknown Source)
at oracle.toplink.internal.queryframework.ExpressionQueryMechanism.selectAllRows(Unknown Source)
at oracle.toplink.queryframework.ReadAllQuery.execute(Unknown Source)
…………………………………………………………………………………………………………………………………………………………………………………………………

As you can see in the above Thread Dump data snippet, most of the time was spent by our application / Toplink trying to fetch the data while consuming / iterating through the ResultSet object.

The Oracle JDBC Driver is configured with a default row pre fetch of 10 (If more than 10 rows exist in the result set of an SQL query, the driver will iteratively return 10 rows at a time up to the total of the result set. The fetch size is the amount of rows the driver is keeping in its cache during a transaction).

It was found that a lot of our SQL was returning about 100 up to 2000 data rows per execution so we then approximately calculated the network latency impact based on the default JDBC driver fetch settings:

·       Formula:  Query Response Time = Query Execution Time + (network latency * (#rows returned / defaultRowPrefetch))
·       The SQL used as for our problem reproduction was returning 1500 data rows with an Oracle SQL execution of only 100ms:
·       Total Query response time (WL cluster #2) = 100ms + (50ms * (1500 / 10)) = 7600 ms
·       Total Query response time (WL cluster #1) = 100ms + (0.3ms * (1500 / 10)) = 145 ms

As you can see in the above results, the difference of response time was drastic between each regional Weblogic cluster and completely dependent on the network latency. A combination of higher response time from region #2 to region #1 and low default fetch value was identified as the root cause of the problem.


Solution

Since network latency could not be improved in our situation, the next logical solution was to attempt to increase the default Oracle JDBC fetch to a higher value (500 in our case) in order to limit the number of back and forward to the database to a minimum level. This was achieved by adding a Property in the Weblogic Connection Pool settings:

·         Find below the SQL response time after increasing defaultRowPrefetch to 500:
·         Total Query response time (WL cluster #2) = 100ms + (50ms * (1500 / 500)) = 250 ms
·         Total Query response time (WL cluster #1) = 100ms + (0.3ms * (1500 / 500)) = 100 ms

As you can see, the above results were quite conclusive and we were able to reduce the impact of the Network latency to a minimum level by increasing the Oracle JDBC driver defaultRowPrefetch.

Recommendations

I hope this article has helped you understand the analysis approach and how this simple tuning can have such a positive impact on your SQL response time. Fine below my general recommendations regarding defaultRowPrefetch tuning:

-          When facing abnormal slowdown of your SQL response time, please review your network latency and # of rows returned from your application. You may benefit from increasing the default Oracle JDBC driver defaultRowPrefetch to a higher value if your SQL queries is returning 50 rows+
-          Be careful and do not increase the defaultRowPrefetch too high. Please keep in mind the such data is stored in the JDBC driver memory buffer cache so always do proper analysis, capacity planning and assess any negative impact on your Java memory following any defaultRowPrefetch increase