Mike Ault's thoughts on various topics, Oracle related and not. Note: I reserve the right to delete comments that are not contributing to the overall theme of the BLOG or are insulting or demeaning to anyone. The posts on this blog are provided “as is” with no warranties and confer no rights. The opinions expressed on this site are mine and mine alone, and do not necessarily represent those of my employer.

Thursday, December 04, 2008

Happy Holiday Shopping - Not!

Recently, due to the dramatic increase in the number of online shoppers and unplanned for peaks in attempted access/logins, several major retail websites have experienced anything from dramatic slowdowns to out and out failure. Of note are the recent problems at the Sears and Dr. Pepper websites.

Sears and Others:

Dr. Pepper:

In order to understand what is happening you must understand what occurs when a user attempts to access a website, let alone when they attempt a complex transaction. When a user logs in to a website the user identification must be validated from a database using several database queries. For example: Is the user ID valid? Is the password correct? Has the password expired? What type of user is this? etc. It is even worse if the user has to create a login and password as well as enter other data such as address or credit data. All of this transactional traffic causes a flurry of underlying IO subsystem activity and web traffic across the networks. And of course all of this is magnified when actual query and sales transactions are also being performed.

All of the traffic to the system database can overwhelm the underlying IO subsystem, especially when it is disk array based which is the case with a majority of databases. Disks generally can only respond within a 5 millisecond window. Now, with on-disk caching and large caches in the disk arrays this response time can sometimes be reduced to 1 millisecond but as the caches flood with high activity performance generally drops to 5 milliseconds per IO or more. As the time to respond increases the number of users which can be served drops in a direct example of Little’s Law.

Unfortunately, while you can increase the IOPS (input output per second) by increasing the number of disks in an array, you cannot decrease the latency beyond that of any one disk. In fact, for a large read you will suffer from convoy effect where the slowest disk in the array involved in the IO operation will drive down the performance of the IO itself.

Several websites have found that by placing the user tables and other transaction dependent tables on low latency storage such as solid state SAN replacements like the RamSan 400 or 500 series from Texas Memory Systems they can dramatically increase the capability to support many more transactions (read clients) than before. An example of the dramatic improvements that can be achieved is shown in the recent press release from The Container Store group:


Realizing that their underlying IO subsystem couldn’t support the expected peak load generated from a positive blurb about the company on The Oprah Winfrey Show, the folks at The Container Store turned to TMS RamSan technology.

Another area that must support high concurrent logins and maintain strict inventories is in the online gaming community. Eve Online was able to go from 15,000 online users to over 17,000 with a 40x performance improvement:


As a final example, IC Source first tried doubling the number of disks in their infrastructure; the net result? Zero improvement only to solve the problem with SSD (RamSan) technology:


What all of these examples show is that many times your problem cannot be solved by throwing more disks at it, you must get to the ultimate problem, latency, to fix the issue.

With latency numbers of 15 microseconds (.015 milliseconds) and the resulting ability to support 600,000 IOPS and 4.5 GB/sec the RamSan-440 is the heavy hitter in the TMS line as far as throughput however, it is DDR RAM based and currently limited to 0.5 terabytes of storage capacity per unit. Compared to normal disk latency of 5 milliseconds the RamSan-440 shows a factor of 333 decrease in latency, even if you get 1.0 millisecond latency due to short-stroking and aggressive caching the 440 is still a factor of 67 times faster.


The RamSan-500 series utilizing Flash memory technology tops out at 2 terabytes (with promises to go to 8 terabytes in the near future) of storage capacity per unit and 200 microsecond ( 0.2 milliseconds ) peak latency with a minimal IOPS rating of 100,000 IOPS. Just to put this in perspective, EMC recently achieved 100,000 IOPS using 2-CX30 racks and 391 disk drives in a RAID0 configuration, the RamSan-500 does it with a single 4-U unit, and the RamSan-440 beats it by a factor of 6 in a similar footprint as the RamSan-500.


Being solid state (except for the cooling fans) means the RamSan technology is inherently more reliable and less prone to crashes. Current estimates of MTBF show a value of at least 500,000 hours per RamSan before a critical failure. With built in RAID, write leveling for Flash and the use of ECC memory as well as ChipKill the RamSans have built in redundancy. Utilizing Flash drives for backup, the 440 also provides unparalleled data persistence with triple battery backup ensuring that all data is written to Flash before shutdown. The RamSan-500 also uses battery backup to ensure that the 64 gigabytes of DDR cache is written to Flash on shutdown.

Another big movement is the green technology push we see today. I’ll leave the math on figuring the amount of electrical and cooling costs disk arrays to you, but at under 300 watts for the RamSan-500 and 600 watts for the RamSan-440 it is easy to see the cost savings from the energy footprint reduction.

So what should you take away from all of this? Essentially, if you have reached the latency limit of your IO subsystem, increasing the number of disks will not help. The only way to improve the performance of the system is to reduce overall latency. If you are an online retailer looking at the coming holiday season with dread because of performance issues, look at using SSD technology such as the TMS RamSan 400/500 series to slay the latency monster and achieve stellar website performance.