Mike Ault's Blog: 2009

Tuesday, September 15, 2009

Exadata Part Deux

I just watched Larry Ellison’s webcast announcing the new Exadata-2, the Database Machine made for both data warehouse and OLTP systems. The new Exadata contains both DDR3 (about 450 GB) and Flash (around 5.6 TB) cache areas (not SSD) and up to 100 TB of SAS or 336 TB of SATA raw disk capacity in a full rack of 8 DB servers and 14 Exadata-2 cells. Of course formatted and ASM redundancy will eat at least half of that disk capacity or more.

Larry promises 10X at a minimum improvement in speed and response with 1,000,000 IOPS per full rack.

All of this is wonderful news except for one thing, it locks you completely into Oracle technology. The Oracle database, Oracle Linux OS and Oracle SUN hardware. None of this new toy works without special Oracle software and licensing. You can’t run anything else on it but Oracle. It comes preconfigured with Oracle and OS. The cost for a full rack is 1.15 million plus around another million in license fees.

Interesting thing that you could take the 8 DB servers with a normal amount of memory, not the massive 72 GB per server in the Database Machine, 4 RamSan-620’s and 2 RamSan-420’s, a couple of FC switches and some HBAs and get that 1,000,000 random IOPS Flash plus an additional 1,200,000 random IOPS DDR and have 5 TB of Flash, fully redundant, and 512 GB of DDR2 fully redundant and still come in at less than 1.15 million and a lot less on license fees. Shoot, even add in some SATA or SAS drives and use the preferred read technology in 11g ASM or other global filesystems and viola! The same configuration. Nice thing, you can run anything on it. Of course Oracle runs really sweet on this configuration but so will SQLServer, MySQL, BSD, or anything else. It can also run Windows, Linux, or with IBM servers, AIX as well as Solaris’ various flavors.

So you get the same bang, since Oracle11g software will run (with its compression technology, query optimization and partitioning) the only thing you don’t get is the added power requirements of the SAS or SATA drives in the 10 extra Cells needed to get what we get out of 12-U of space.

One other point Larry made was that to expand you just add more servers of more cells. Supposedly in Oracle11gR2 RAC allows this anyway. Now we don’t have fancy software that automatically rebalances the whole storage array, but guess what, Oracle provides ASM for free and it does.

Ok, I will concede one point, the parallel query software at the cell level (10K USD per disk license fee in Exadata-1) will probably result in some queries (who knows, maybe all) running a wee bit faster than without it, but I would love to see a comparison between equal configurations just to see!

Wednesday, June 03, 2009

On Gun Control in the USA

Gun control is a hot button item with liberal democrats and conservative republicans. The desire of the most rabid of the gun control crowd is to remove all guns from everyone except the police and military (and some would take the guns from the police as well!) The conservatives want almost no limits on gun ownership. I believe the reality of the matter should be somewhere in between.

Coming from a long line of hunters (going back to the first Alt’s in America in 1738 or so) I own a rifle and a shotgun. The rifle is an old converted Mauser and the shotgun is a semi-automatic plugged to only allow 5 shells to be loaded at any one time. At this time I don’t own any handguns but have no negative feelings towards them. In fact, when finances allow I plan on purchasing a hand gun for home defense. Of course I will be sure both my wife and I take a class on its safe use from a local gun range or police station.

Now, to the meat of the matter, I am a card carrying member of the NRA, does that mean I stand for everything they do? No, of course not. I believe that Americans have the right to keep and bear arms, both for home defense and just in case politicians get too uppity and decide to try to take more power than they need to govern. And I believe that is the reason that the framers of the constitution put the right to arms clause in there, to remind politicians that they aren’t all powerful. Do I think the average citizen should own fully automatic weapons, flame throwers, grenade launchers or Abrams tanks? No, definitely not.

Currently there are some half baked proposals before the Congress and Senate proposing bans on semiautomatic weapons with some fairly liberal definitions of exactly what is a semiautomatic weapon, unfortunately, that definition would not only ban a semi-automatic AK47 with an extended banana clip holding hundreds of rounds, but my son-in-laws 30-06 semiautomatic hunting rifle. Thus, it is a bad bill and should not be passed.

I am all for limits based on cyclic rate of fire and clip size, but not so loosely specified that it can produce a cascade effect onto sporting rifles. In fact, the very weapons that the bill is designed to restrict are already restricted by existing laws. Most of the crimes that the liberals are saying would be curtailed by this new ban were not committed with the semi-automatic rifles they are trying to ban! The one thing that the liberals who propose these bans forget is that criminals don’t follow the law, that is implicit in their being criminals. As countries such as England and Australia have found, once you make it impossible for a law abiding citizen to own guns, crime rates increase.

Why is it that liberals can’t read statistics? The states and cities with the most restrictive bans on guns have the highest crime rates, the ones with the least restrictive tend to have the lowest. When a criminal (who could care less that he is using an illegal hand gun) knows that a home owner or car driver probably won’t be armed because the law forbids it, it makes them an easy target for that criminal. Some states have already made it illegal to defend yourself, allowing criminals injured by homeowners to sue the homeowner for damages! Talk about legal insanity! More people where killed by bad Doctors last year than by guns in the USA, should we ban all Doctors? More children were killed by bicycles than by guns, should we ban bicycles?

Remember, it is already illegal to own or sell fully automatic weapons, grenades, rocket launchers and generally illegal for anyone to own anything other than a rifle or handgun that can be used for hunting or sport shooting. By law the guns must be registered. Most states require a permit and a safety course.

I agree that before a person can own a gun they should have to take a gun safety course, not be a felon and be at a responsible age. However, beyond restrictions on fully automatic weapons, ridiculous calibers (bullet sizes) and rocket launchers, if Granny wants a semiautomatic AK47 with a case of ammo, the more power to her!

If gun laws will make us safe then why are the states and cities with the most restrictive gun laws the most unsafe? If laws make us safe then we should have nothing to fear since it is illegal to commit a crime with a gun, it is illegal for felons to own guns and the most dangerous fully-automatic guns are already illegal. Face it, the reason most politicians want guns banned is they are afraid that if they really screw up we will hold them accountable for it, an unarmed population is much easier to control than an armed one.

Thursday, May 28, 2009

Preliminary TPC-C Results

(Note: These are preliminary, non-reviewed results)

I just wrapped up my series of TPC-C based tests involving the TMS RamSan SSDs and HDD technologies (SSD-Solid State Disk, HDD-Hard Disk Drive.) For those not familiar a TPC-C has 9 tables with related indices and uses 5 basic transactions to compute a number called a tpm-C, essentially a transactions per minute completed for a given size of database and user load. I used a scale factor of 1000, which means that in the WAREHOUSE table of the TPC-C schema I had 1000 entries, the rest of the database is based on multiples of this value. The TPC-C is a benchmark used for OLTP type systems.

There are two basic methods to do a TPC-C test, one with many clients (read thousands) and built in latencies (usually on the order of up to a second per transaction) and the other with few clients and no latencies. In the first methodology in order to achieve tpm-C approaching 100,000 you would need close to 10,000 clients, roughly a 1 client per 10 tpm-C ratio. In the second method you can get to a tpm-C of over 100,000 with as few as 250 clients. Since I was limited as to the number of client boxes I had access to I decided to use the second method and utilized a ramp-up from 5 to 500 clients in increments of 5 using the Benchmark Factory (BMF) tool from Quest to both build, and, repopulate the TPC-C database before each run. Kevin Dalton at Quest supplied me with some custom build scripts for BMF that allowed me to build any or all of the tables simultaneously, this helped immensely with the build and needed rebuilds. The scripts he provided should be available on Toadworld.

The test system consisted of a 4-node RAC cluster with each node having 8-2 GHz CPUs and 16 GB of memory using an Infiniband cross connect running Oracle11g, 11.1.0.7. The storage subsystems consisted of 2-RamSan400’s, a RamSan500 and 6-15 disk racks of 114GB 15K RPM disks. The RamSan400’s were used for Redo logs and Undo and Temporary tablespaces. ATPC-C schema was built on the RamSan500 (which also held the SYSTEM, USERS and SYSAUX tablesapces for both tests) the RamSans where configured with multiple 4Gbps fibre channel lines each in a multipathed configuration. A duplicate TPC-C schema was built on the HDD array which was configured as two groups of 45 disk drives on different 2Gbps fibre channel links and then used as a single diskgroup in ASM with a failure group (essentially RAID10.)

The RamSan based subsystem utilized 20-4Gbps fibre channel links to access the 500,000+ IOPS that the system was capable of delivering. The HDD subsystem utilized the 2-2Gbps links to utilize the 27,000 IOPS the system was capable of delivering. The database was configured using 8K blocks, in a TPC-C the major IO path is single block reads. Using 8K read/write sizes the interfaces for the RamSan subsystem can handle 1,310,720 IOPS, the subsystem for the HDD can handle 64,000. Now before you scream foul, remember that each disk drive can only handle a maximum (and this is being generous) of 300 IOPS each, 90*300=27,000, so the HDD interface is more than enough for the available IOPS.

In the above configuration the impact of Redo logs, Undo segments and temporary tablespace access are essentially eliminated and access times to the SYSTEM, SYSAUX and USER tablespaces minimized, thus we are only looking at the access times for the TPC_C data and indices as the possible variables in the tests.

In the first set of tests the database was run with 1 to 4 servers with 9 GB of DB cache size. The HDD results peaked at 1051 TPS and 55 users. The RamSan results peaked at 3775 TPS and 245 users. The HDD results fell off from 1051 TPS with 55 users to 549 TPS and 15 users going from 4 down to 1 server. The SSD results fell from 3775 TPS and 245 users down to 1778 TPS and 15 users. However, the 1778 TPS seems to be a transitory spike in the SSD data with an actual peak occurring at 1718 TPS and 40 users. From this data, given the choice, you would get better performance from a single, 8-CPU, 16 GB memory server running against a RamSan configuration than you would with a 4-node RAC configuration running against a HDD system by a factor of 1.63.

In the second test the affects of memory starvation on the RAC environment was tested. The DB cache size was ramped down from 9 GB to around 1 GB. The tests showed that the RamSan configuration handles the memory stress better by a performance factor ranging from 3 at the top end at 9 GB total cache size to a huge 7.5 at the low end comparing a 1.05 GB cache on the SSD to a 4.5 GB cache on the HDD run. The HDD run was limited to 4.5 GB at the lower end by time constraints however, as performance would only get worse as the cache was reduced more, further testing was felt to be redundant.

Of course the reason for the wide range between the upper and lower memory results, from a factor of 3 to 7.5 times better performance by the SSD, can be traced to the increase in physical IO that resulted from not being able to cache results and the subsequent increase in db file sequential reads.

The tests show that the SSD array handles reduction in available memory much better than the HD array. Even at a little over 1 GB of total cache area per node for a 4-node RAC environment, the SSD outperformed the HD array at a 9 GB total cache size per node for a 4-node RAC using identical servers and database parameters. Unfortunately due to bugs in the production release of Oracle11g, release 11.1.0.7, we were unable to test the automatic memory management feature of Oracle, the bug limits total SGA size to less than 3-4 gigabytes per server.

A third test involved attempting to defeat the cache fusion logic that assumes that getting data over the interconnect is faster than from the storage subsystem. In our test system the transfer of blocks was taking 3 milliseconds while reading and writing was taking less than 1 millisecond so using a ping (write to disk, read from disk) would theoretically be better performance than using the interconnect to transfer the block. A look at documentation seems to indicate that if you set the gc_files_to_locks parameter then cache fusion is defeated.

In the first run we set the GC_FILE_TO_LOCKS parameter to “1-13:0” which was supposed to turn on fine grain locking (the “0” setting) for files 1 to 13 (all of our datafiles.) Unfortunately this increased the number of cache transfers and caused a decrease in performance.

In the second test we researched the GC_FILES_TO_LOCKS parameter a bit more and found that we shouldn’t have set the parameter for the UNDO tablespaces and, even though the examples showed the “0” setting for setting fine grain locks, we decided to set it to a hard number of locks equal to or greater than the number of blocks in each datafile. The documentation also showed that generally speaking the SYSTEM and SYSAUX tablespaces don’t need as many locks so we decided to set the value to 1000 for those tablespaces. You don’t set the parameter for temporary tablespaces. This led to a new setting of “1-2:1000EACH:7-13:4128768EACH”. Tests with this new setting showed an increase in GC related waits.

The attempts to limit or eliminate cache fusion failed and resulted in poorer performance overall. While we reduced the gc buffer busy acquire waits and in some cases increased the db file sequential reads, the increase in the wait times for the GC related waits offset any gains that were made. All attempts to defeat cache fusion with the use of the gc_files_to_locks parameter were unsuccessful and resulted in poorer performance.

Once the full paper is available I will post a link to the TMS website.

Wednesday, March 18, 2009

More fun with Oracle11g 11.1.0.7

I have been working with the 11.1.0.7 release of Oracle11g for a couple of months now. I have to say I have not been impressed with it. As released there are several major bugs that really limit 11.1.0.7 usability and stability.

Right out of the gate when I upgraded my 11.1.0.2 environment to 11.1.0.7 I had problems. If you utilize the automated memory management features using MEMORY_MAX_SIZE and MEMORY_TARGET, on 64 bit RedHat Linux you are limited to 3 gigabytes of total SGA, that’s right folks, less than you can get on 32 bit! This was easily mitigated by going back to full manual memory management. So much for the much touted AMM improvements in 11g.

My setup for testing is a 4 node luster using Dell servers with 8 CPUs each and 16 gigabytes of memory per server. I am using Infiniband interconnects on the 4 server cluster. Initially I would get ORA-00600 errors and server reboots if we loaded down the system doing a single user 300 GB TPC-H. For the most part I was able to fix that by relinking the Oracle kernel to use the RDS protocol with Infiniband, however, on some large queries using cross instance parallel query and large amounts of partitioning I still cannot complete a full 22 query TPC-H without at least one of the queries throwing an ORA-00600.

The next test was TPC-C using large numbers of users and cross instance parallel query. Following the example of some recent HP TPC-H runs I attempted to use clustered tables. As long as I used single table clusters, everything seemed to go alright, but as soon as I attempted a dual table hash cluster things went pear-shaped and I had to revert to normal tables. This was a known Oracle bug it seems but hasn’t been made publically available in Metalink. It seems I could load one of the tables in the cluster but then when attempting to load the second table in the 11g cluster it would error.

The next bit of testing involved creating a large table (60,000,000 2K rows) then duplicating the table, doing a select that forced temp usage, doing a self-join with large temp usage and creating a large index. Since I had 8 CPUs I initially started with a DOP of 8 increasing it by 8 as I added in each node up to a maximum of 32. Sometimes this would work, other times parallel query servers would die horrible deaths causing the transaction to fail. Now, for a 16 gig server with 8 CPUs I would expect to be able to get well above a DOP of 8 on a single table. In fact, during the load process the last several iterations use a single instance DOP of 64 with no issues. It seems as soon as I add more than 1 or 2 additional nodes, things start to get wonky (a scientific term meaning “wonky”), parallel query slaves commit suicide at an alarming rate and sometimes entire servers reboot. I reduced to 6 DOP and still see the failures on most runs above a server count of 2.

Of course you must realize from talking with support most of the internal testing and support is done on 32 bit machines. Many times support told me that they can’t test large memories or large databases because of this. Come on Oracle, take some of the interest on your billions in profits and upgrade your support servers! Buy some disk drives! Geesh!

Of course most of my work is done on gear that supports over 100,000 IOPS (actually more than this by several factors) and many beta testers may not have the 500 disk drives to allow for 100,000 IOPS or access to SSD technology to get this type of IO load. Maybe these Oracle11g stability issues only show up at high IO rates. However, with many manufacturers now offering SSD technology (SUN, HP, EMC) to go with their standard disk systems and of course, TMS, Violin, FusionIO and several other SSD vendors offering SSD systems that easily top 100,000 IOPS or more, Oracle better get with it and start doing significant testing in high IOPS situations, or at least, make sure and include the partners who have access to this technology in their betas.

I am not sure what happened with the beta testing of this (11.1.0.7) release, had I been included I would have quickly showed them these bugs as I believe in testing real world database situations. I don’t know whether to look forward to 11gR2 with anticipation or dread (I was not invited into that beta either), if the beta programs are as detailed as they were for 11gR1 we might be in for a rough ride.

Tuesday, March 10, 2009

Outsource This!

I just watched President Obama’s speech on the educational reforms that he proposes. As laudable as his proposed fixes are, I mean after all, who isn’t for tougher standards for graduates, more pay for good teachers and accountability at all levels, I feel perhaps he has missed the boat in a key area.

The New York Times recently published an article stating that the college graduate has seen a bigger decline in employment than just about any other sector. Don’t get me wrong, a college graduate still gets better pay and benefits than a non-college graduate, but just getting a degree will not guarantee you the American dream any longer.

The key area that President Obama has neglected is that American college graduates expect to receive wages in excess of $60K per year. Unfortunately foreign graduates who still reside overseas are usually pleased to receive the equivalent of a third of this amount or less, especially in the Asian countries and in Latin America. This means that by outsourcing a company can get 3 times the number of employees and hopefully three times the productivity. This isn’t always the case but that is the logic used to justify sending jobs overseas.

What many companies are finding however is that those budget employees come with their own issues. For example, a programmer who has never experienced a free market economy may not understand all the intricacies of accounting in such a system. Of course I probably needn’t mention the various cultural and language difficulties that are also experienced with outsourcing.

I am afraid that many of the stimulus jobs that seem to be offered from the President’s plans seem to be short term blue collar type jobs that won’t do much to help the college graduate with advanced degrees. If you work outside of the computer industry, for example, for two years as a heavy machinery operator on a construction job, you will find your computer science degree probably isn’t worth much anymore.

Soon the only way to get a job will be to move out of the USA and go to a country where the cost of living is in line with what the companies who outsource are willing to pay employees, but don’t look for benefits!

Monday, March 02, 2009

God and Technology

Many times I have been asked how as a technologist/scientist I can believe in God. It is as if these people believe you cannot be a thinking person and also believe in a higher power. Now, this is not to say I am one of the folks who think every word in the Bible (Old and New Testament) is the gospel truth (no pun intended.) Like I said, I am a thinking Christian, as such I question everything and look for confirmation of those things that can be confirmed. Of course with religion, as in all things in science and lay areas, eventually you reach a point where it is either believe or not believe and I chose to believe.

I prefer to think of myself as more a Jeffersonian Christian rather than a Paulian Christian. It amazes me that many Christians swallow hook-line-and-sinker every word penned by the one Apostle that never actually met Jesus face-to-face. Many scholars feel that Paul was sent on so many missions not because he was good at them but because the other Apostles really couldn’t stand him and hoped that he wouldn’t return. Many believers take Paul’s letters out of context and usually completely incorrectly as their meaning is understood by true bible scholars. In fact many articles of faith were added in after the fact to the regular gospels as can be proved by stylistic differences and from going back to the earliest known translations. The fact is that no matter how current your translation you are still starting from flawed beginnings. Many of the books that “didn’t make the cut” when the first bibles were compiled were destroyed as heresy thus removing them from possible future examination.

With any scientific field of study you reach a certain point and you can go no further, from that point on you have to accept things on theories and faith. Even with the less than whole cloth parts of the Bible’s New Testament removed, what is left is still an amazing history of a real man who lived, and died for his faith and his friends. Is Jesus the Son of God? Yes, but then we all are the sons and daughters of God. Did Jesus die for our sins? Yes. Was he raised from the dead? This is where there is some contention over what was added after the fact and what is whole cloth. But let’s examine this.

Two men: One knows he is the physical Son of God, he knows that no matter what evil painful things happen on Earth he has a place at the right hand of God. Second man, a man, with man’s frailties, and doubts. Now, both give up their Earthly lives for what they believe in, which one required greater faith? Would Jesus be less or more of an inspiration if he was a frail human or the anointed Son of God? Would you believe it was a vote by a group of flawed humans (the Nicene Council) that decided Jesus was a deity and it was actually a very close vote.

Unfortunately the only documents that provide “proof” of Jesus’ deity are in the Bible and using the Bible to prove the Bible is circular logic and therefore flawed. It is like using the Dianetics text to prove L. Ron Hubbard’s qualifications as a deity. Since most Islamic accounts of Jesus are actually taken from the Bible then related texts quoting them are not relevant. While Jesus is mentioned in some historical texts none go into great detail as to his birth, (a virgin was a young maiden, not someone who had never had sex) life (he existed and taught and was hated by Rome), death (he was probably crucified) or resurrection. These accounts of the resurrection were actually added after the original text in the gospels, it was felt that the Mithran belief, having a virgin birth, life and resurrection was a big spur to add these passages. (see: http://www.near-death.com/experiences/origen048.html) Among non-Christian historians, Pliny the Younger, Suetonius and Tacitus refer to Jesus, as does Josephus (Joseph ben Matthias).

So, do I believe Jesus is my savior? Yes, his teachings show the way to the father as he himself said “There is no way to the father but through me” meaning through his teachings we find the way. Do I believe in the resurrection? That is more complex to answer. Unfortunately the resurrection is one of the parts added after the original text, that makes it suspect in my eyes. The key question is “Would I believe without the resurrection?” The answer is yes, I would, so whether I believe in the resurrection or not is moot. You are free to believe as you wish and I would never dream of pushing my beliefs onto you, after all, we have free will.

Do I believe in the life everlasting? Yes. There is enough anecdotal evidence to show that something of us exists after death, that the spirit left gets rewarded or punished based on a set of criteria created within its own belief structure is not that far of a reach. After all Jesus also said “In my Father’s house are many mansions, I go there now to prepare a place for you.”
So what have I attested to? I believe in God and I believe in Jesus. I believe Jesus died for my sins. I believe Jesus’ teachings show the way to true belief in God. If this diminishes me in some folk’s eyes, so be it. However, it is not for people that I live, I live for God, my family and myself.

Does God reject technology? No, he gave us technology to better ourselves. As with any tool, how we use it determines whether the tool is good or bad. Do modern teachings contradict the Bible? No. If you realize that most of the creation story is metaphor, used to explain something we don’t have full understanding of even today, to ignorant herders. When looked at as metaphor it actually parallels what we know. Look at the theory of vacuum fluctuations and compare it to the story of genesis. As to what timelines are used in the Bible, again, try to explain millions of years to someone who barely understands how to count his herd of sheep.

It is odd that those that insist on a literal interpretation of their favorite passages that damn certain behaviors or exalt other behaviors they profess to believe in themselves but then they tell us other parts are metaphor. Remember that the true test of a prophet is that what he prophesizes comes true. I am afraid many of the added texts in the Bible fail this test as do many of the founders of many splinter religions who used imperfect understanding to make prophesies of specific dates for events such as the “rapture” and the second coming. Of course instead of applying the test to these false prophets and rejecting them, they merely allowed that they were mistaken but that their prophesies would come true eventually.

Limiting God to a simplistic creation story is demeaning to God. That God could put into motion such marvelous mechanisms as those behind vacuum fluctuations and evolution is a testament to his greatness, not a detractor from it. That we cannot understand everything is a testament to God’s greatness . God guides technology, giving us tools to better understand his universe.

Burying our heads in simplistic beliefs because we cannot understand God’s plan as implemented in his universe is an affront to God.

Tuesday, February 17, 2009

RMOUG Notes

Well, I am on my way home from the Rocky Mountain Oracle Users Group Trainings Days event. I presented a paper titled “Is Oracle Tuning Obsolete,” a copy of which can be found on the http://www.rmoug.org/ site or at http://www.superssd.com/. While I was there I attended two presentations on the Oracle/HP Exadata Database Machine, one by Kevin Closson and another by Tom Kyte, both of Oracle.

My only complaint about both presentations was that when they presented the user test results they neglected to show the full (or even partial) configurations of the servers and disk systems they had tested against. Rather like saying my car is 10 times faster than Joe’s and telling you mine is a 1995 Dodge Avenger and failing to mention Joe’s is a Stanley Steamer. Be that as it may, I still enjoyed the presentations and the best take away was from Kevin’s presentation when he said that “If your current system is fully tuned, has adequate disk resources, and is performing well, the Exadata has nothing to offer you.” An example from kevin would be a 128 CPU Superdome with 128 4GFC HBAs that were being fed by ample XP storage as that would be 51GB/s ingest-capable. Also during Tom’s presentation he admitted the primary target of the Exadata was those shops with row-after-row of Oracle servers followed by a single Netezza or Teradata server or servers.

Essentially the Exadata Database Machine is targeted at the larger (several terabytes) data warehouse that would otherwise be placed on a Netezza or Teradata machine and I couldn’t agree more. However, it would be a fun test to replace the disks in an Exadata cell with a RamSan-500 and see what (if any) additional performance could be gained. After all, the disks are still the limiting factor in the performance of the system. For example, a single Exadata cell tops out at around 2,700 IOPS, according to white papers on the Oracle site; a single RamSan-500 can sustain 100,000 mixed read/write IOPS and 25,000 pure write IOPS with minimal response times. As far as I can tell, no additional smarts are built into the Exadata disk drives in the place of special firmware, such as is supposedly done with EMC systems, so replacing the drives with a single RamSan-500, either set up as 12 LUNs, or as a single large LUN, should be easy.

Another interesting discussion I had during this time frame was with our (Texas Memory Systems) own Matt Key, one of our Storage Applications Engineers, about why adding the Enterprise Flash Drives (EFDs) to arrays produces little if any benefit for large levels of writes. Turns out there is an upward limit on the bandwidth a single disk tray can handle and with the EFDs instead of disk drives the disk tray tops out at around 3000 (between 1600 and 3200) or so IOPS (based on a 64K stripe) so you actually need several trays (with a max of only 4 drives to a tray because of other limits) to get significant write IOPS. For comparison, the RamSan-500 can handle 25,000 sustained write IOPS. Now don’t get me wrong, the EFDs can improve the performance of certain types of loads when compared to a standard array with no EFDs, but if you are write-heavy you may wish to consider other technologies. Note: The calculations are based on a 200 megabyte/second FC-AL bandwidth with 64K writes, since RAID6 is used there are 2-64K writes for each write, 200MBS/64K=3200 IOPS, 200MBS/128K=1600 IOPS. These limitations apply to all array-based EFDs.

The RamSan-500 makes an excellent complement to any enterprise array, especially if you use the preferred read technology to read from the RamSan-500 while writing to both, for example, when you are using array-based replication, such as SRDF, to provide geo-mirroring of the frame to a remote site. By offloading the reads, the number of writes that can be supported by the array can be increased as a factor of the percent of reads in the work load, thus increasing the performance of the entire system. As an example, if you have an 80/20 read/write workload and you offload the 80 percent of reads to the RamSan, this frees up the array to handle a factor of 4 more writes, up to the actual maximum IOPS of the array. This is a 4X increase in I/O with 0-impact to infrastructure or BCVs.

Oh, on February 24-25 I’ll be in Charlotte, NC presenting at the Southeast Oracle Users Convention (SEOUC). My two presentations are: “My Ideal Data Warehouse System” and “Going Solid: Use of Tier Zero Storage in Oracle Databases.” I hope I see you there!

As I digest more of the information I obtained this week, I will try to write more blog entries. So for now I will sign off. Good bye from 37,000 feet over Colorado!

Mike

Wednesday, February 04, 2009

Do You Need Solid State Technology?

Many times I am asked the question “Should I buy solid state devices for my system?” and each time I have to answer “It depends.” Of course the conversation evolves beyond that point into the particulars of their system and how they are currently using their existing IO storage subsystem. However, the question raised is still valid; do you need SSD technology in your IO subsystem? Let’s look at this question.

SSD Advantages

SSD technology has one big advantage over your typical hard disk based storage array: SSD does not depend on physical movement for retrieval of data. Being non-dependent on physical movement for data retrieval means that you can significantly reduce the latency involved with each data retrieval operation, usually on the order of a factor of 10 (for Flash-based technology) to over 100 (for RAM-DDR -based technology.) Of course cost increases as latency decreases with SSD technology, with Flash running about a quarter of the cost of RAM-DDR technology.

SSD Costs

Flash and DDR-based SSD technology are usually on a par with, or can be cheaper than, IO equivalent SAN based technology. Due to the much lower latency of SSD technology you can get many more input-output operations per second (IOPS) from them than you can from a hard disk drive system. For example, from the “slow” Flash-based technology you can get 100,000 IOPS with an average latency of 0.20 milliseconds worse case. From the fastest DDR based technology you can achieve 600,000 IOPS with a latency of .015 milliseconds.

To achieve 100,000 IOPS from hard drive technology you would need around 500 or more 15K rpm disks at between 2 and 5 milliseconds latency, giving a peak IOPS of around 200 per disk drive for random IO, regardless of their storage capacity. A 450 gigabyte 15K rpm disk drive may have up to 4 or more individual disk platters with 12 read-write heads (one for each side of the disks); however, these read-write heads are mounted on a single armature and are not capable of independent positioning. This limits the latency and IOPS to that of a single disk platter with two heads, so an 146 gigabyte 15K rpm drive will have the same IOPS and latency as a 450 gigabyte 15K rpm drive from the same manufacturer (http://www.seagate.com/docs/pdf/datasheet/disc/ds_cheetah_15k_6.pdf.)

Given that the IOPS and latency are the same regardless of storage capacity for a 146 to 450 gigabytes array of disk drive sizes why not pick the smallest drive and save money? The reason is that to get the best latency you need to be sure not to fill the various disks in the disk drive (from 2 to 4) more than 30% hence the need for so many drives. So to get high performance from your disk based IO subsystem you need to throw away 60-70% of your storage capacity!

Do I Need That Many IOPS?

Many critics of SSD technology state that most systems will never need 100,000 IOPS, and in many cases they are correct. However, in testing using a 300 gigabyte TPC-H (data warehouse) type test load using SSD I was able to get peak loads of over 100,000 IOPS using a simple 4 node Oracle11g Real Application Clusters-based setup. Since many systems are considerably larger than 300 gigabytes and have more users than the 8 users with which I reached 100,000 IOPS, it is not inconceivable that given the capability to achieve 100,000 IOPS of throughput many current databases would easily exceed that value. It must also be realized that the TPC-H system I was testing utilized highly optimized indexing, partitioning, and parallel query technology; eliminate any of these capabilities and the IOPS required increases, sometimes dramatically.

Questions to Ask

So now we reach the heart of the question, do you need SSD for your system? The answer depends on several questions which only you can answer:

1. Is my performance satisfactory? If yes then why are you asking about SSD?
2. Have you maximized use of memory and optimization technologies built into your database system? If no, then do this first.
3. Has my disk IO subsystem been optimized? (Enough disks and HBAs?)
4. Is my system spending an inordinate amount of time waiting on the IO subsystem?

If the answer to question 1 is no, and questions 2, 3 and 4 are yes then you are probably a candidate for SSD technology. Don’t get me wrong, if I had my choice I would skip disk based systems altogether and use 100% SSD in any system I bought, given the choice. However, you are probably locked into a disk-based setup with your existing system until you can prove it doesn’t deliver the needed performance. Let’s look closer at the 4 questions.

Question 1 is sometimes hard to answer quantitatively. Usually the answer to 1 is more of a gut reaction than anything that can be put on paper. The users of the system can usually tell you if the system is as fast as they need it to be. Another consideration in question 1 is: performance is fine now, but what if you grow by 25-50%? If your latency is at 3-5 milliseconds on the average now, adding more load may drive it much higher.

Question 2 will require analysis of how you are currently configured. An example from Oracle is that a wait on db file sequential reads can indicate that not enough memory has been allocated to cache data blocks read based on index reads. So, even if the indexes are cached, the data blocks are not and must be read into the cache on each operation. A sequential read is an index-based read followed by a data table read and usually should be cached if there is sufficient memory. Another Oracle wait, db file scattered reads indicates full table scans are occurring. Usually full table scans can be mitigated by use of indexes or partitioning. If you have verified your memory is being used properly (perhaps everything that can be allocated has been) and you have utilized the proper database technologies and performance is still bad, then it is time to consider SSD.

A key source of wait information is of course the Statspack or AWR report for Oracle based systems. One additional benefit to the Statspack or AWR reports is that they both can contain a Cache Advisory sub-section that is used to actually determine if adding memory will help your system. By examining waits and looking at the cache advisory section of the report you can quickly determine if adding memory will help your performance. Another source of information about the database cache is the V$BH dynamic performance view. The V$BH view contains an entry for every block in the cache and with a little SQL against the view you can easily determine if there are any free blocks or if you have used all available and are in need of more. Of course use of the automated memory management features in 10g and 11g limit the usefulness of the V$BH view. In Oracle Grid and Database control interfaces (providing you have the proper licenses) you also get performance advisories which will tell you when you need more memory. Of course if you have already maximized the size of your physical memory, most of this is moot.

Question 3 may have you scratching your head. Essentially if your disk IO subsystem has reached its lowest latency and the number of IO channels (as determined by the number and type of host bus adapters) is such that no channel is saturated, then your disk-based system is optimized. Usually this is shown by latency being in the 3-5 millisecond range and still having high IO waits with low CPU usage.

Question 4 means you look at you system CPU statistics and you see that your CPUs are being under utilized and the IO waits are high, indicating the system is waiting on IO to complete before it can continue processing. A Unix or Linux based system in this condition may show high values for runqueue even when the CPU is idle.

SSD Criticisms

Other critics of SSD technology cite problems with reliability and possible loss of data with Flash and DDR technologies. In some forms of Flash and DDR they are correct; if the Flash isn’t wear leveled properly or the DDR is not properly backed-up. However, as long as the Flash technology utilizes proper wear leveling and the RAM-DDR system uses proper battery backup with permanent storage on either a Flash or hard disk based subsystem, then those complaints are groundless.

The final criticism of SSD technology is usually that the price is still too high compared to disks. I look at an advertisement from a local computer store and I see a terabyte disk drive for $99.00; it is hard for SSD to compete with that low base cost. Of course, I can’t run a database on a single disk drive. Given our 300 gigabyte system, I was hard-pressed to get reasonable performance placing it on 28 – 15K high performance disk drives; most shops would use over 100 drives to get performance. So on a single disk to SSD comparison yes, this cost would appear to be an issue; however, you must look at other aspects of the technology. To achieve high performance most disk-based systems utilize specialized controllers and caching technology and spread IO across as many disk drives as possible. This is known as short-stroking the drive so that only 20-30% of each disk drive is ever actually used. The disks are rarely used individually, instead they are placed in a RAID array (usually RAID 5, RAID 10, or some exotic RAID technology). Once the cost of additional cabinets, controllers, and other support technology is added to the base cost of the disks, not to mention any additional firmware costs added by an OEM, the costs soon level between SSD and standard hard drive SAN systems.

In a review of benchmark results the usual ratio between needed capacity and capacity utilized to achieve performance is 40-50 to 1, meaning our 300 gigabyte TPC-H system would require at least 12 terabytes of storage to provide adequate performance spread over at least 200 or more disk drives. To contrast that, an SSD based system would only need a factor of 2 to 1 (to allow for the indexes and support files).

In addition to the base equipment costs, most disk arrays consume a large amount of electricity which then results in larger heat loads for your computer center. In many cases the SSD technology only consumes a fraction of the energy and cooling costs of regular disk based systems, providing substantial electrical and cooling cost savings over their lifetimes. SSD by its very nature is green technology.

When Doesn’t SSD help?

SSD technology will not help CPU bound systems. In fact, SSD may increase the load on overworked CPUs by reducing IO based waits. Therefore it is better to resolve any CPU loading issues before considering a move to SSD technology.

In Summary

The basic rule for determining if your system would benefit from SSD technology is that if your system is primarily waiting on IO then SSD technology will help mitigate the IO wait issue.

Saturday, January 31, 2009

Scuba Diving New York City

A few months ago I wrote a blog about global warming and some possible mitigating actions we could all take (here: http://mikerault.blogspot.com/2008_04_01_archive.html) some might have been a bit off the wall, but were meant to make people think. Of course now many leading scientists are saying there is nothing we can do to radically affect global warming trends for the next 1000 years (http://www.abc.net.au/news/stories/2009/01/27/2475687.htm). The article pretty much echoed the sentiments I expressed in an earlier blog of mine in which I stated that while we as a species may have added 1 or 2 percent to the overall picture we are on a warming and CO2 curve that is following a natural cycle that seems to occur about every 100,000 years or so according to studies of ice cores from the Vostok site in Antarctica (http://www.daviesand.com/Choices/Precautionary_Planning/New_Data/). If there was a direct correlation between temperature and CO2 concentration then we would all be par boiled since we are at a level almost 100 PPM above highest historical levels and so the temperature should be 5-10 degrees above what it currently is, when in fact, we are actually cooler by up to 2 degrees than we should be according to the historical data.

Perhaps my most wild eyed suggestion was to place a solar shield between the Earth and the sun to reduce the amount of solar energy reaching the Earth. Oddly the most wild eyed suggestion seems to be the only one that would make a difference at all. Unless we find a way to reduce the amount of solar energy reaching the Earth’s surface we can expect global temperatures to increase steadily until there are no ice caps, Arctic or Antarctica.

The net effect of the melting of the melting of the Arctic ice cap would be negligible since it is in effect floating so the change in sea levels would be near zero, however, the polar bears would have a few issues. The biggest problem would be the Antarctic ice sheet which is resting on the continent of Antarctica. As it melts and adds to the water levels the weight compressing the Antarctic land mass decreases causing rebound. Between the added water level and the displacement from rebound we are talking over 100 feet of additional water levels around the world. Say good bye to New York, London, Hong Kong, Tokyo, almost all of Florida, heck, most of our seaports. You think Bangladesh has problems with flooding now, just wait.

We need to start planning and doing something now. By reducing the amount of sunlight we receive by 10% we could nip this issue in the bud before 42nd street is considered an advanced level scuba dive. This would be done by orbiting a single large sunscreen or multiple smaller sunscreens in the L1 Lagrange point. Maybe by making these sunscreens actually into thermionic generators or simple solar cell arrays we could also provide large amounts of energy which could be sent back to Earth via microwave beams for use as a green energy source(http://gltrs.grc.nasa.gov/reports/2004/TM-2004-212743.pdf). To block 10 percent of the energy of the sun that reaches the Earth sounds a bit crazy but it could be done.

Even by spreading large clouds of metallic debris (how about all those aluminum cans we see lining the roadways?) or using a few large asteroids pushed into place using low thrust ion drives (http://www.grc.nasa.gov/WWW/ion/) .

It is time to look out for all of us and put aside petty (in the scheme of global warming scale disaster) differences and pull together to really do something to fix this problem. Driving a fuel efficient car and turning your thermostat down in the winter and up in the summer may give you warm and fuzzy (or cold and fuzzy depending on the season) feeling but it won’t amount to a hill of beans when it comes to helping fix global warming.

Wednesday, January 07, 2009

Database Entomology in Oracle11g Land

For the past several months I have been working with Oracle11g, 11.1.0.6 to be exact, doing TPC-H runs, tuning exercises and putting it through its paces. My platform is a 4-node, 32 CPU 64 bit Dell cluster connected to both a solid state disk (SSD) storage array set and standard 15K hard drive (HD) JBOD arrays. On this test rack I created a 300 gigabyte TPC-H test database, well, actually 600 gigabytes with dual identical TPC-H setups in the same database one on SSD and the other on HD.

As a pre-test I had created a small TPC-H (not more than 30 gigabytes) and on a single-server rig I could run the complete 22 query TPC-H set of queries in parallel query. Of course I wasn’t using high levels of partitioning and parallel query simultaneously with the small data set.
I knew it would be interesting when I got to query 9 on the 300 gigabyte dataset and on query 9 I got an ORA-00600:

ORA-00600: internal error code, arguments [kxfrGraDistNum3],[65535],[4]

When I ran an 8 stream randomized order query run I also periodically received:

ORA-12801: error signaled in parallel query server P009, instance dpe-d70:TMSTPCH3 (3)
ORA-00600: internal error code, arguments: [qks3tStrStats4], [], [], [], [], [], [], []

On query 18 once in a while but not every run. Due to not being a full customer (only having a partner CSI number) I was unable to report these as possible bugs. I did completely check out OTN and Metalink as well as Google and no one else seems to be having these issues, of course how many folk are running Oracle11g 11.1.0.6 or 11.1.0.7 with RAC, cross instance parallel query and heavy partitioning and sub-partitioning?

I had a quick look at the Oracle11g 11.1.0.7 release notes and saw a load of bug fixes and hoped mine were covered, even though a text search didn’t show up the ORA-00600 arguments I received. So I bit the Oracle bullet and performed an upgrade.

Well, actually 2 sets of upgrades. First I upgraded my home, 32-bit (this is important later) servers and other than the usual documentation gotchas and needing to set the database as exclusive, start it, stop it and then reset it as a cluster before the dbua program would run properly, I was successful and now have an Oracle11g 11.1.0.7 instance running on my home RAC setup. A quick run against the 30 GB database showing no really stellar improvements against my test setup using JBOD arrays for TPC-H and it successfully ran Query 9 against the non-partitioned, smaller 30 gigabyte data set.

Feeling pleased with the success I immediately set out the next morning to update my large test environment, my 64 bit cluster. The CRS update went smoothly, and other than some space issues (you may want to add a datafile to your SYSTEM tablespace) and some package problems (for some reason DBMS_SQLTUNE and DBMS_ADVISOR where missing) the database upgrade went fine, right up to the point of starting the instances under 11.1.0.7. It seems there is just a small bug with the use of the new MEMORY_TARGET parameter and release 11.1.0.7…you can’t go above 3 gigabytes! This is why I said that the upgrade on 32 bits was important to remember, in 32 bit systems you will rarely get above a 3 gigabyte SGA size once you allow for user logins, process space and operating system memory needs. However, one of the major reasons for going to 64 bit is to have SGA sizes in Oracle greater than 4 gigabytes. Now, if you go back to using the SGA_MAX_SIZE and SGA_TARGET or the full manual specifications such as SHARED_POOL_SIZE and DB_CACHE_SIZE you can get above the 3 gigabyte setting.

Another annoying thing with 11g and the MEMORY_MAX_SIZE setting is that you cannot exceed MEMORY_MAX_SIZE with the sum of your SGA settings plus PGA_AGGREGATE_TARGET. Now for those of you with small sort sizes this isn’t really a problem and you probably won’t have any issues. However, with a TPC-H you need a large sort size so you need a large PGA_AGGREGATE_TARGET but, you quickly get into trouble with the limits on MEMORY_MAX_SIZE and large PGA_AGGREGATE_TARGETS. With my 16 gigabytes of memory per server I was only able to allocate about 8 gigabytes to Oracle (actually about 7.5 in 11.1.0.6 and 3 in 11.1.0.7) anything larger and I would get errors. So needless to say, I turned off the total memory management and did it the old fashioned way.

Finally, I had my instances up, about a 7 gigabyte SGA with 5 gigabytes of DB_CACHE_SIZE and 5.5 gigabytes of PGA_AGGREGATE_TARGET. Oh, did I mention, your SHARED_POOL_SIZE must be at least 600 megabytes for Oracle11g 11.1.0.7? If it isn’t you will get 4030 errors on startup, I ended up with 750 megabytes worth.

So after 8 hours of upgrade time for my main set of instances I was finally ready to run a TPC-H. Guess what, with the upgrade in place, more cache and bigger sort areas, I seem to be getting worse performance than with the sub-optimal query resolving, bug-ridden 11.1.0.6 version. Looks like they fed the bugs instead of killed them. Oh well, back to the tuning bench.

Mike Ault's Blog