Mike Ault's thoughts on various topics, Oracle related and not. Note: I reserve the right to delete comments that are not contributing to the overall theme of the BLOG or are insulting or demeaning to anyone. The posts on this blog are provided “as is” with no warranties and confer no rights. The opinions expressed on this site are mine and mine alone, and do not necessarily represent those of my employer.
Wednesday, March 18, 2009
More fun with Oracle11g 184.108.40.206
Right out of the gate when I upgraded my 220.127.116.11 environment to 18.104.22.168 I had problems. If you utilize the automated memory management features using MEMORY_MAX_SIZE and MEMORY_TARGET, on 64 bit RedHat Linux you are limited to 3 gigabytes of total SGA, that’s right folks, less than you can get on 32 bit! This was easily mitigated by going back to full manual memory management. So much for the much touted AMM improvements in 11g.
My setup for testing is a 4 node luster using Dell servers with 8 CPUs each and 16 gigabytes of memory per server. I am using Infiniband interconnects on the 4 server cluster. Initially I would get ORA-00600 errors and server reboots if we loaded down the system doing a single user 300 GB TPC-H. For the most part I was able to fix that by relinking the Oracle kernel to use the RDS protocol with Infiniband, however, on some large queries using cross instance parallel query and large amounts of partitioning I still cannot complete a full 22 query TPC-H without at least one of the queries throwing an ORA-00600.
The next test was TPC-C using large numbers of users and cross instance parallel query. Following the example of some recent HP TPC-H runs I attempted to use clustered tables. As long as I used single table clusters, everything seemed to go alright, but as soon as I attempted a dual table hash cluster things went pear-shaped and I had to revert to normal tables. This was a known Oracle bug it seems but hasn’t been made publically available in Metalink. It seems I could load one of the tables in the cluster but then when attempting to load the second table in the 11g cluster it would error.
The next bit of testing involved creating a large table (60,000,000 2K rows) then duplicating the table, doing a select that forced temp usage, doing a self-join with large temp usage and creating a large index. Since I had 8 CPUs I initially started with a DOP of 8 increasing it by 8 as I added in each node up to a maximum of 32. Sometimes this would work, other times parallel query servers would die horrible deaths causing the transaction to fail. Now, for a 16 gig server with 8 CPUs I would expect to be able to get well above a DOP of 8 on a single table. In fact, during the load process the last several iterations use a single instance DOP of 64 with no issues. It seems as soon as I add more than 1 or 2 additional nodes, things start to get wonky (a scientific term meaning “wonky”), parallel query slaves commit suicide at an alarming rate and sometimes entire servers reboot. I reduced to 6 DOP and still see the failures on most runs above a server count of 2.
Of course you must realize from talking with support most of the internal testing and support is done on 32 bit machines. Many times support told me that they can’t test large memories or large databases because of this. Come on Oracle, take some of the interest on your billions in profits and upgrade your support servers! Buy some disk drives! Geesh!
Of course most of my work is done on gear that supports over 100,000 IOPS (actually more than this by several factors) and many beta testers may not have the 500 disk drives to allow for 100,000 IOPS or access to SSD technology to get this type of IO load. Maybe these Oracle11g stability issues only show up at high IO rates. However, with many manufacturers now offering SSD technology (SUN, HP, EMC) to go with their standard disk systems and of course, TMS, Violin, FusionIO and several other SSD vendors offering SSD systems that easily top 100,000 IOPS or more, Oracle better get with it and start doing significant testing in high IOPS situations, or at least, make sure and include the partners who have access to this technology in their betas.
I am not sure what happened with the beta testing of this (22.214.171.124) release, had I been included I would have quickly showed them these bugs as I believe in testing real world database situations. I don’t know whether to look forward to 11gR2 with anticipation or dread (I was not invited into that beta either), if the beta programs are as detailed as they were for 11gR1 we might be in for a rough ride.