Mike Ault's thoughts on various topics, Oracle related and not. Note: I reserve the right to delete comments that are not contributing to the overall theme of the BLOG or are insulting or demeaning to anyone. The posts on this blog are provided “as is” with no warranties and confer no rights. The opinions expressed on this site are mine and mine alone, and do not necessarily represent those of my employer.
Thursday, May 28, 2009
Preliminary TPC-C Results
I just wrapped up my series of TPC-C based tests involving the TMS RamSan SSDs and HDD technologies (SSD-Solid State Disk, HDD-Hard Disk Drive.) For those not familiar a TPC-C has 9 tables with related indices and uses 5 basic transactions to compute a number called a tpm-C, essentially a transactions per minute completed for a given size of database and user load. I used a scale factor of 1000, which means that in the WAREHOUSE table of the TPC-C schema I had 1000 entries, the rest of the database is based on multiples of this value. The TPC-C is a benchmark used for OLTP type systems.
There are two basic methods to do a TPC-C test, one with many clients (read thousands) and built in latencies (usually on the order of up to a second per transaction) and the other with few clients and no latencies. In the first methodology in order to achieve tpm-C approaching 100,000 you would need close to 10,000 clients, roughly a 1 client per 10 tpm-C ratio. In the second method you can get to a tpm-C of over 100,000 with as few as 250 clients. Since I was limited as to the number of client boxes I had access to I decided to use the second method and utilized a ramp-up from 5 to 500 clients in increments of 5 using the Benchmark Factory (BMF) tool from Quest to both build, and, repopulate the TPC-C database before each run. Kevin Dalton at Quest supplied me with some custom build scripts for BMF that allowed me to build any or all of the tables simultaneously, this helped immensely with the build and needed rebuilds. The scripts he provided should be available on Toadworld.
The test system consisted of a 4-node RAC cluster with each node having 8-2 GHz CPUs and 16 GB of memory using an Infiniband cross connect running Oracle11g, 188.8.131.52. The storage subsystems consisted of 2-RamSan400’s, a RamSan500 and 6-15 disk racks of 114GB 15K RPM disks. The RamSan400’s were used for Redo logs and Undo and Temporary tablespaces. ATPC-C schema was built on the RamSan500 (which also held the SYSTEM, USERS and SYSAUX tablesapces for both tests) the RamSans where configured with multiple 4Gbps fibre channel lines each in a multipathed configuration. A duplicate TPC-C schema was built on the HDD array which was configured as two groups of 45 disk drives on different 2Gbps fibre channel links and then used as a single diskgroup in ASM with a failure group (essentially RAID10.)
The RamSan based subsystem utilized 20-4Gbps fibre channel links to access the 500,000+ IOPS that the system was capable of delivering. The HDD subsystem utilized the 2-2Gbps links to utilize the 27,000 IOPS the system was capable of delivering. The database was configured using 8K blocks, in a TPC-C the major IO path is single block reads. Using 8K read/write sizes the interfaces for the RamSan subsystem can handle 1,310,720 IOPS, the subsystem for the HDD can handle 64,000. Now before you scream foul, remember that each disk drive can only handle a maximum (and this is being generous) of 300 IOPS each, 90*300=27,000, so the HDD interface is more than enough for the available IOPS.
In the above configuration the impact of Redo logs, Undo segments and temporary tablespace access are essentially eliminated and access times to the SYSTEM, SYSAUX and USER tablespaces minimized, thus we are only looking at the access times for the TPC_C data and indices as the possible variables in the tests.
In the first set of tests the database was run with 1 to 4 servers with 9 GB of DB cache size. The HDD results peaked at 1051 TPS and 55 users. The RamSan results peaked at 3775 TPS and 245 users. The HDD results fell off from 1051 TPS with 55 users to 549 TPS and 15 users going from 4 down to 1 server. The SSD results fell from 3775 TPS and 245 users down to 1778 TPS and 15 users. However, the 1778 TPS seems to be a transitory spike in the SSD data with an actual peak occurring at 1718 TPS and 40 users. From this data, given the choice, you would get better performance from a single, 8-CPU, 16 GB memory server running against a RamSan configuration than you would with a 4-node RAC configuration running against a HDD system by a factor of 1.63.
In the second test the affects of memory starvation on the RAC environment was tested. The DB cache size was ramped down from 9 GB to around 1 GB. The tests showed that the RamSan configuration handles the memory stress better by a performance factor ranging from 3 at the top end at 9 GB total cache size to a huge 7.5 at the low end comparing a 1.05 GB cache on the SSD to a 4.5 GB cache on the HDD run. The HDD run was limited to 4.5 GB at the lower end by time constraints however, as performance would only get worse as the cache was reduced more, further testing was felt to be redundant.
Of course the reason for the wide range between the upper and lower memory results, from a factor of 3 to 7.5 times better performance by the SSD, can be traced to the increase in physical IO that resulted from not being able to cache results and the subsequent increase in db file sequential reads.
The tests show that the SSD array handles reduction in available memory much better than the HD array. Even at a little over 1 GB of total cache area per node for a 4-node RAC environment, the SSD outperformed the HD array at a 9 GB total cache size per node for a 4-node RAC using identical servers and database parameters. Unfortunately due to bugs in the production release of Oracle11g, release 184.108.40.206, we were unable to test the automatic memory management feature of Oracle, the bug limits total SGA size to less than 3-4 gigabytes per server.
A third test involved attempting to defeat the cache fusion logic that assumes that getting data over the interconnect is faster than from the storage subsystem. In our test system the transfer of blocks was taking 3 milliseconds while reading and writing was taking less than 1 millisecond so using a ping (write to disk, read from disk) would theoretically be better performance than using the interconnect to transfer the block. A look at documentation seems to indicate that if you set the gc_files_to_locks parameter then cache fusion is defeated.
In the first run we set the GC_FILE_TO_LOCKS parameter to “1-13:0” which was supposed to turn on fine grain locking (the “0” setting) for files 1 to 13 (all of our datafiles.) Unfortunately this increased the number of cache transfers and caused a decrease in performance.
In the second test we researched the GC_FILES_TO_LOCKS parameter a bit more and found that we shouldn’t have set the parameter for the UNDO tablespaces and, even though the examples showed the “0” setting for setting fine grain locks, we decided to set it to a hard number of locks equal to or greater than the number of blocks in each datafile. The documentation also showed that generally speaking the SYSTEM and SYSAUX tablespaces don’t need as many locks so we decided to set the value to 1000 for those tablespaces. You don’t set the parameter for temporary tablespaces. This led to a new setting of “1-2:1000EACH:7-13:4128768EACH”. Tests with this new setting showed an increase in GC related waits.
The attempts to limit or eliminate cache fusion failed and resulted in poorer performance overall. While we reduced the gc buffer busy acquire waits and in some cases increased the db file sequential reads, the increase in the wait times for the GC related waits offset any gains that were made. All attempts to defeat cache fusion with the use of the gc_files_to_locks parameter were unsuccessful and resulted in poorer performance.
Once the full paper is available I will post a link to the TMS website.