Mike Ault's thoughts on various topics, Oracle related and not. Note: I reserve the right to delete comments that are not contributing to the overall theme of the BLOG or are insulting or demeaning to anyone. The posts on this blog are provided “as is” with no warranties and confer no rights. The opinions expressed on this site are mine and mine alone, and do not necessarily represent those of my employer.

Thursday, December 04, 2008

Happy Holiday Shopping - Not!

Recently, due to the dramatic increase in the number of online shoppers and unplanned for peaks in attempted access/logins, several major retail websites have experienced anything from dramatic slowdowns to out and out failure. Of note are the recent problems at the Sears and Dr. Pepper websites.

Sears and Others:
http://money.cnn.com/2008/11/28/technology/bc.apfn.tec.holidayshop.ap/index.htm
http://tasquatch-sentinelling.blogspot.com/2008/11/blog-post_7519.html

Dr. Pepper:
http://www.bizjournals.com/atlanta/stories/2008/12/01/daily45.html

In order to understand what is happening you must understand what occurs when a user attempts to access a website, let alone when they attempt a complex transaction. When a user logs in to a website the user identification must be validated from a database using several database queries. For example: Is the user ID valid? Is the password correct? Has the password expired? What type of user is this? etc. It is even worse if the user has to create a login and password as well as enter other data such as address or credit data. All of this transactional traffic causes a flurry of underlying IO subsystem activity and web traffic across the networks. And of course all of this is magnified when actual query and sales transactions are also being performed.

All of the traffic to the system database can overwhelm the underlying IO subsystem, especially when it is disk array based which is the case with a majority of databases. Disks generally can only respond within a 5 millisecond window. Now, with on-disk caching and large caches in the disk arrays this response time can sometimes be reduced to 1 millisecond but as the caches flood with high activity performance generally drops to 5 milliseconds per IO or more. As the time to respond increases the number of users which can be served drops in a direct example of Little’s Law.

Unfortunately, while you can increase the IOPS (input output per second) by increasing the number of disks in an array, you cannot decrease the latency beyond that of any one disk. In fact, for a large read you will suffer from convoy effect where the slowest disk in the array involved in the IO operation will drive down the performance of the IO itself.

Several websites have found that by placing the user tables and other transaction dependent tables on low latency storage such as solid state SAN replacements like the RamSan 400 or 500 series from Texas Memory Systems they can dramatically increase the capability to support many more transactions (read clients) than before. An example of the dramatic improvements that can be achieved is shown in the recent press release from The Container Store group:

http://www.superssd.com/pressrelease/2008-12-02.htm

Realizing that their underlying IO subsystem couldn’t support the expected peak load generated from a positive blurb about the company on The Oprah Winfrey Show, the folks at The Container Store turned to TMS RamSan technology.

Another area that must support high concurrent logins and maintain strict inventories is in the online gaming community. Eve Online was able to go from 15,000 online users to over 17,000 with a 40x performance improvement:

http://www.superssd.com/success/ccpgames.htm

As a final example, IC Source first tried doubling the number of disks in their infrastructure; the net result? Zero improvement only to solve the problem with SSD (RamSan) technology:

http://www.superssd.com/success/icsource.htm

What all of these examples show is that many times your problem cannot be solved by throwing more disks at it, you must get to the ultimate problem, latency, to fix the issue.

With latency numbers of 15 microseconds (.015 milliseconds) and the resulting ability to support 600,000 IOPS and 4.5 GB/sec the RamSan-440 is the heavy hitter in the TMS line as far as throughput however, it is DDR RAM based and currently limited to 0.5 terabytes of storage capacity per unit. Compared to normal disk latency of 5 milliseconds the RamSan-440 shows a factor of 333 decrease in latency, even if you get 1.0 millisecond latency due to short-stroking and aggressive caching the 440 is still a factor of 67 times faster.

http://www.superssd.com/products/RamSan-440/

The RamSan-500 series utilizing Flash memory technology tops out at 2 terabytes (with promises to go to 8 terabytes in the near future) of storage capacity per unit and 200 microsecond ( 0.2 milliseconds ) peak latency with a minimal IOPS rating of 100,000 IOPS. Just to put this in perspective, EMC recently achieved 100,000 IOPS using 2-CX30 racks and 391 disk drives in a RAID0 configuration, the RamSan-500 does it with a single 4-U unit, and the RamSan-440 beats it by a factor of 6 in a similar footprint as the RamSan-500.

http://www.superssd.com/products/RamSan-500/

Being solid state (except for the cooling fans) means the RamSan technology is inherently more reliable and less prone to crashes. Current estimates of MTBF show a value of at least 500,000 hours per RamSan before a critical failure. With built in RAID, write leveling for Flash and the use of ECC memory as well as ChipKill the RamSans have built in redundancy. Utilizing Flash drives for backup, the 440 also provides unparalleled data persistence with triple battery backup ensuring that all data is written to Flash before shutdown. The RamSan-500 also uses battery backup to ensure that the 64 gigabytes of DDR cache is written to Flash on shutdown.

Another big movement is the green technology push we see today. I’ll leave the math on figuring the amount of electrical and cooling costs disk arrays to you, but at under 300 watts for the RamSan-500 and 600 watts for the RamSan-440 it is easy to see the cost savings from the energy footprint reduction.

So what should you take away from all of this? Essentially, if you have reached the latency limit of your IO subsystem, increasing the number of disks will not help. The only way to improve the performance of the system is to reduce overall latency. If you are an online retailer looking at the coming holiday season with dread because of performance issues, look at using SSD technology such as the TMS RamSan 400/500 series to slay the latency monster and achieve stellar website performance.

5 comments:

Mike said...

Nice to have you along for the ride!

Unknown said...

I think RAM SAN is the biggest boon to an RDBMS-centric system since ... well ... ever!

I wonder how many millions of dollars are spent world-wide by companies on trying to perfect IO-bound SQL, when RAM SAN would fix the issue overnight?

Finally, throwing hardware at a problem can produce amazingly good results without the need to wheel-in bearded consultants who spend ages fiddling with some rogue SQL, only to pipe-up with the classic "You need to refactor a lot of your code, because it's IO-bound"!

Mike said...

Richard,

Not just RDBMS systems, any system that needed low latency IO such as video or audio editing, image rendoring, geospatial data retrieval and processing, etc.

Unknown said...

@Mike:

Absolutely. I have a techie question: is the current speed of RAM SAN likely to be improved, or is it as fast as it's ever likely to be?

Mike said...

The engineers at TMS say:

Matt Key:

Most storage vendors use commodity processors to achieve their performance. This allows for higher aggregregate performance as each new generation CPU is released. Unfortunately, the ability to run parallel CPU cores does nothing for latency.

RamSans use a very specialized logic in their controllers that achieves latency lower than any commodity processor can achieve. So, in terms of latency, the RamSan will always be faster than any upgrade to new generation CPUs.

As for peak performance numbers, such as I/Os per second or Megabytes per Second, RamSan are continually designed to exceed single-server capabilities. Adding more-powerful controllers is something that is done to meet demands of today’s intensive applications. This is quite visible with our flagship DRAM system, the RamSan-440, with 600,000 IOPS and 4.5GBps of bandwidth while still maintaining the extremely low latency of 15 microseconds that our previous generations have offered.

Jamon Bowen:

There are 2 product lines, a RAM based SSD and a Flash based SSD. The RAM based SSD performance is expected to increase as new interfaces become available. These systems are for the most latency sensitive applications where every microsecond counts.

The Flash based SSD is expected to remain at the same performance and possibly even decline as the Flash chips are getting bigger and cheaper rather than faster. The performance on these chips is already 30x faster than disks from a latency perspective and the price is getting so close to the price of enterprise 15k disks that the focus is on reducing the cost per GB without hurting the performance. If they can get to a similar cost per GB and still be 20x faster that works great. If performance is what is absolutely required, then buy the RAM based SSD (it is more than 10x faster than the flash)

And Me:

We are subject to the same constraints as the rest of the industry, as soon as faster technology becomes available we incorporate it as the demand for it becomes a factor in the marketplace. Currently we haven't had much call for faster than the 15 microsecond response times we achieve with the RAMSAN 440 and many times the 200 microsecond response time of the RAMSAN 500 is more than enough for most applications.

So the short answer is: Yes, as technolgy improves we will incorporate the improvements into the RAMSAN and future products.