Mike Ault's thoughts on various topics, Oracle related and not. Note: I reserve the right to delete comments that are not contributing to the overall theme of the BLOG or are insulting or demeaning to anyone. The posts on this blog are provided “as is” with no warranties and confer no rights. The opinions expressed on this site are mine and mine alone, and do not necessarily represent those of my employer.

Friday, March 03, 2006

They AMM to Please...Sometimes

Just an insteresting addition to this (3/16/2006) was at another client site where we saw this behavior on a 10gR2 Enterprise release of RAC as well. - Mike

Ran into another undocumented feature in 10gR2 Standard edition using RAC today. On a RedHat 4.0 4-CPU Opteron (2-Chip, 4-core) using 6 gigabytes of memory in a 2-node RAC, the client kept getting ORA-07445’s when their user load exceeded 60 users per node. At 100 users per node they were getting these errors, a coredump for each and a trace file on each server, for each node, about twice per minute. There didn’t seem to be any operational errors associated with it, but it seriously affected IO rates to the SAN and filled up the UDUMP and BDUMP areas quickly. Of course when the BDUMP area filled up the database tends to choke.

The client is using AMM with SGA_TARGET and SGA_MAX_SIZE set and no hard settings for the cache or shared pool sizes. Initially we filed an ITar or SR or whatever they are calling them these days but didn’t get much response on it. So the client suffered until I could get on site and do some looking.

I looked at memory foot print, CPU foot print and user logins and compared them to the incident levels of the ORA-07445. There was a clear correlation to the number of users and memory usage. Remembering that the resize operations are recorded I then looked in the GV$SGA_RESIZE_OPS DPV and then correlated the various memory operations to the incidences of the ORA-07445, the errors only seemed to occur when a shrink occurred in the shared pool as we saw the error on node 1 where a shrink occurred and none on node 2 where no shrink had happened yet.

Sure enough, hard setting the SHARED_POOL_SIZE to a minimum value delayed the error so that it didn’t start occurring until the pool extended above the minimum then shrank back to it, however, not every time. We were able to boost the number of users to 80 before the error started occurring by hard setting the shared pool to 250 megabytes. A further boost to the shared pool size to 300 megabytes seems to have corrected the issue so far but we will have to monitor this as the number of user processes increases. Note that you need to look at the GV$SGA_RESIZE_OPS DPV to see what resize operations are occurring and the peak size reached to find the correct setting on your system.

It appears that there must some internal list of HASH values that is not being cleaned up when the shared pool is shrunk. This results in the kernel expecting to find a piece of code at a particular address, looking for it and not finding it, this generates the ORA-07445. Of course this is just speculation on my part.

So for you folks using 10gR2 Standard edition with RAC (not sure if it happens with Non-RAC, non-Standard) look at either not using AMM, or be sure to hard set the SHARED_POOL_SIZE to a value that can service your number of expected users and their code and dictionary needs.

1 comment:

Noons said...

Thanks Mike, useful info there.