Mike Ault's thoughts on various topics, Oracle related and not. Note: I reserve the right to delete comments that are not contributing to the overall theme of the BLOG or are insulting or demeaning to anyone. The posts on this blog are provided “as is” with no warranties and confer no rights. The opinions expressed on this site are mine and mine alone, and do not necessarily represent those of my employer.

Wednesday, January 07, 2009

Database Entomology in Oracle11g Land

For the past several months I have been working with Oracle11g, 11.1.0.6 to be exact, doing TPC-H runs, tuning exercises and putting it through its paces. My platform is a 4-node, 32 CPU 64 bit Dell cluster connected to both a solid state disk (SSD) storage array set and standard 15K hard drive (HD) JBOD arrays. On this test rack I created a 300 gigabyte TPC-H test database, well, actually 600 gigabytes with dual identical TPC-H setups in the same database one on SSD and the other on HD.

As a pre-test I had created a small TPC-H (not more than 30 gigabytes) and on a single-server rig I could run the complete 22 query TPC-H set of queries in parallel query. Of course I wasn’t using high levels of partitioning and parallel query simultaneously with the small data set.
I knew it would be interesting when I got to query 9 on the 300 gigabyte dataset and on query 9 I got an ORA-00600:

ORA-00600: internal error code, arguments [kxfrGraDistNum3],[65535],[4]

When I ran an 8 stream randomized order query run I also periodically received:

ORA-12801: error signaled in parallel query server P009, instance dpe-d70:TMSTPCH3 (3)
ORA-00600: internal error code, arguments: [qks3tStrStats4], [], [], [], [], [], [], []

On query 18 once in a while but not every run. Due to not being a full customer (only having a partner CSI number) I was unable to report these as possible bugs. I did completely check out OTN and Metalink as well as Google and no one else seems to be having these issues, of course how many folk are running Oracle11g 11.1.0.6 or 11.1.0.7 with RAC, cross instance parallel query and heavy partitioning and sub-partitioning?

I had a quick look at the Oracle11g 11.1.0.7 release notes and saw a load of bug fixes and hoped mine were covered, even though a text search didn’t show up the ORA-00600 arguments I received. So I bit the Oracle bullet and performed an upgrade.

Well, actually 2 sets of upgrades. First I upgraded my home, 32-bit (this is important later) servers and other than the usual documentation gotchas and needing to set the database as exclusive, start it, stop it and then reset it as a cluster before the dbua program would run properly, I was successful and now have an Oracle11g 11.1.0.7 instance running on my home RAC setup. A quick run against the 30 GB database showing no really stellar improvements against my test setup using JBOD arrays for TPC-H and it successfully ran Query 9 against the non-partitioned, smaller 30 gigabyte data set.

Feeling pleased with the success I immediately set out the next morning to update my large test environment, my 64 bit cluster. The CRS update went smoothly, and other than some space issues (you may want to add a datafile to your SYSTEM tablespace) and some package problems (for some reason DBMS_SQLTUNE and DBMS_ADVISOR where missing) the database upgrade went fine, right up to the point of starting the instances under 11.1.0.7. It seems there is just a small bug with the use of the new MEMORY_TARGET parameter and release 11.1.0.7…you can’t go above 3 gigabytes! This is why I said that the upgrade on 32 bits was important to remember, in 32 bit systems you will rarely get above a 3 gigabyte SGA size once you allow for user logins, process space and operating system memory needs. However, one of the major reasons for going to 64 bit is to have SGA sizes in Oracle greater than 4 gigabytes. Now, if you go back to using the SGA_MAX_SIZE and SGA_TARGET or the full manual specifications such as SHARED_POOL_SIZE and DB_CACHE_SIZE you can get above the 3 gigabyte setting.

Another annoying thing with 11g and the MEMORY_MAX_SIZE setting is that you cannot exceed MEMORY_MAX_SIZE with the sum of your SGA settings plus PGA_AGGREGATE_TARGET. Now for those of you with small sort sizes this isn’t really a problem and you probably won’t have any issues. However, with a TPC-H you need a large sort size so you need a large PGA_AGGREGATE_TARGET but, you quickly get into trouble with the limits on MEMORY_MAX_SIZE and large PGA_AGGREGATE_TARGETS. With my 16 gigabytes of memory per server I was only able to allocate about 8 gigabytes to Oracle (actually about 7.5 in 11.1.0.6 and 3 in 11.1.0.7) anything larger and I would get errors. So needless to say, I turned off the total memory management and did it the old fashioned way.

Finally, I had my instances up, about a 7 gigabyte SGA with 5 gigabytes of DB_CACHE_SIZE and 5.5 gigabytes of PGA_AGGREGATE_TARGET. Oh, did I mention, your SHARED_POOL_SIZE must be at least 600 megabytes for Oracle11g 11.1.0.7? If it isn’t you will get 4030 errors on startup, I ended up with 750 megabytes worth.

So after 8 hours of upgrade time for my main set of instances I was finally ready to run a TPC-H. Guess what, with the upgrade in place, more cache and bigger sort areas, I seem to be getting worse performance than with the sub-optimal query resolving, bug-ridden 11.1.0.6 version. Looks like they fed the bugs instead of killed them. Oh well, back to the tuning bench.

8 comments:

skymaster said...

Hi Mike,

11g is a LOT of fun and I am finding out that half of the new features do not work as intended with many side effects. I recently found this out by teaching a class recently! Hope you are doing well in Houston and was great to meet a legend like you at OpenWorld last year.

Regards,
Ben Prusinski
http://oracle-magician.blogspot.com

Mike said...

Ben,

YOu have an odd idea of fun! It is interesting however!With the problems I am seeing it makes me wonder about some of the great results posted by Oracle and others regarding 11.1.0.7.

Mike

Noons said...

I think the jury is still waaaay out on 11g being production ready.
I've got a feeling that once again product management were given free reign to do whatever they felt was "cool".
The result is still another major release with lots of instability, new features that simply don't work well in a real life environment and major regression bugs.
I'm eagerly awaiting 11gr2: then maybe we'll get 11g into maintenance mode and serious bugs will finally start to be addressed.

Mike said...

Noons,

You are right on the money. I seem to remember Oracle promising that they wouldn't use the old method of using stepped releases (10.2.0.6 feeding 11.1.0.6 and 10.2.0.7 feeding 11.1.0.7, etc) but wih the apparent reintroduction of some bugs fixed in subsequent 10g releases in 11g releases looks like they are back to their old ways.

Mike

skymaster said...

Hi Mike,

Yeah, my idea of fun is a little different, I am a techie and geek at heart. Spending Friday night kernel tuning is my idea of fun especially with lots of good beer and pizza. Anyways I am beta testing 11gR2 right now and very interesting. Now if I could only get my hands on one of those new Exadata machines to play with that would be nirvana. And yes, 11g is not quite ready yet for production. I am dealing with some production issues with 11.1.0.7 especially after an upgrade from 10gR2 to 11g.

Cheers,
Ben

Mike said...

Yep, still having issues, you really have to blow the large pool way up and the shared pool to get parallel query and rac to work worth a hoot. Really rediculous...

Parvesh said...

Hi Mike,

An off topic question - do you still do independent consulting. If yes, then what would be the process to avail your services.

Thanks
Parvesh

Mike said...

Parvesh,

I don't do consulting now except for TMS clients. However, I know some excellent consultants, so drop me an email at mikerault@yahoo.com and I will put you in contact with them.