Mike Ault's thoughts on various topics, Oracle related and not. Note: I reserve the right to delete comments that are not contributing to the overall theme of the BLOG or are insulting or demeaning to anyone. The posts on this blog are provided “as is” with no warranties and confer no rights. The opinions expressed on this site are mine and mine alone, and do not necessarily represent those of my employer.

Wednesday, March 18, 2009

More fun with Oracle11g 11.1.0.7

I have been working with the 11.1.0.7 release of Oracle11g for a couple of months now. I have to say I have not been impressed with it. As released there are several major bugs that really limit 11.1.0.7 usability and stability.

Right out of the gate when I upgraded my 11.1.0.2 environment to 11.1.0.7 I had problems. If you utilize the automated memory management features using MEMORY_MAX_SIZE and MEMORY_TARGET, on 64 bit RedHat Linux you are limited to 3 gigabytes of total SGA, that’s right folks, less than you can get on 32 bit! This was easily mitigated by going back to full manual memory management. So much for the much touted AMM improvements in 11g.

My setup for testing is a 4 node luster using Dell servers with 8 CPUs each and 16 gigabytes of memory per server. I am using Infiniband interconnects on the 4 server cluster. Initially I would get ORA-00600 errors and server reboots if we loaded down the system doing a single user 300 GB TPC-H. For the most part I was able to fix that by relinking the Oracle kernel to use the RDS protocol with Infiniband, however, on some large queries using cross instance parallel query and large amounts of partitioning I still cannot complete a full 22 query TPC-H without at least one of the queries throwing an ORA-00600.

The next test was TPC-C using large numbers of users and cross instance parallel query. Following the example of some recent HP TPC-H runs I attempted to use clustered tables. As long as I used single table clusters, everything seemed to go alright, but as soon as I attempted a dual table hash cluster things went pear-shaped and I had to revert to normal tables. This was a known Oracle bug it seems but hasn’t been made publically available in Metalink. It seems I could load one of the tables in the cluster but then when attempting to load the second table in the 11g cluster it would error.

The next bit of testing involved creating a large table (60,000,000 2K rows) then duplicating the table, doing a select that forced temp usage, doing a self-join with large temp usage and creating a large index. Since I had 8 CPUs I initially started with a DOP of 8 increasing it by 8 as I added in each node up to a maximum of 32. Sometimes this would work, other times parallel query servers would die horrible deaths causing the transaction to fail. Now, for a 16 gig server with 8 CPUs I would expect to be able to get well above a DOP of 8 on a single table. In fact, during the load process the last several iterations use a single instance DOP of 64 with no issues. It seems as soon as I add more than 1 or 2 additional nodes, things start to get wonky (a scientific term meaning “wonky”), parallel query slaves commit suicide at an alarming rate and sometimes entire servers reboot. I reduced to 6 DOP and still see the failures on most runs above a server count of 2.

Of course you must realize from talking with support most of the internal testing and support is done on 32 bit machines. Many times support told me that they can’t test large memories or large databases because of this. Come on Oracle, take some of the interest on your billions in profits and upgrade your support servers! Buy some disk drives! Geesh!

Of course most of my work is done on gear that supports over 100,000 IOPS (actually more than this by several factors) and many beta testers may not have the 500 disk drives to allow for 100,000 IOPS or access to SSD technology to get this type of IO load. Maybe these Oracle11g stability issues only show up at high IO rates. However, with many manufacturers now offering SSD technology (SUN, HP, EMC) to go with their standard disk systems and of course, TMS, Violin, FusionIO and several other SSD vendors offering SSD systems that easily top 100,000 IOPS or more, Oracle better get with it and start doing significant testing in high IOPS situations, or at least, make sure and include the partners who have access to this technology in their betas.

I am not sure what happened with the beta testing of this (11.1.0.7) release, had I been included I would have quickly showed them these bugs as I believe in testing real world database situations. I don’t know whether to look forward to 11gR2 with anticipation or dread (I was not invited into that beta either), if the beta programs are as detailed as they were for 11gR1 we might be in for a rough ride.

10 comments:

Noons said...

Looks like Oracle is back to the same-old-same-old nonsense...

I've had them knocking on my door asking us to upgrade to 11g, for months now.

The usual arguments: FUD that if we don't, we'll "fall behind", "everyone else has done it", etcetc.

Of course: we only NOW managed to get a 10gr2 that is relatively stable and does produce ALL the results that our SQL asks for...

Like: I've got time to chase up their bugs and do the beta testing their testers should have done in the first place?

Sure. Right...

Hang in there Mike, and keep letting us know what the real story is.

business starter said...

Quick question. Mike what's your take on Oracle disk load capability. I think you have talked about it in the past.

Thanks

Bruce

Mike said...

Not sure what you are asking as far as disk load, can you be more specific?

Larry's Beard said...

Hi Mike - Please let us know when you get fixes for these issues.

Mike said...

Larry,

No promised fixes until release 11.2 which has been promised since last summer...should see something soon (I hope!) Unfortunately I have no more pull with Oracle than thenext guy (or gal!)

Mike

Kiran said...

Mike, I am using 64 bit OEL and memory target does allow me upwards of 3G. I tried 10G and it all worked fine. Perhaps, you have run into some Redhat issue with dev/shm. Did you configure shm properly. What was the error you got.

Oliver said...

On "Since I had 8 CPUs I initially started with a DOP of 8 increasing it by 8 as I added in each node up to a maximum of 32. Sometimes this would work, other times parallel query servers would die horrible deaths causing the transaction to fail."

Hi Mike, did you try setting the DOP to default and just let Oracle downgrade (DOPs) for queries as need be?

Mike said...

Yep, same issue, it would start the servers fine then one or more would die.

Oliver said...

Just curious, what's your

parallel_max_servers, parallel_min_servers,
parallel_threads_per_cpu

set to?

Oliver said...

We used to have that problem until we increased parallel_min_servers to a non-zero value.