Wednesday, October 08 - 2008

Keeping systems alive - even during disasters

In the last article in the disaster recovery series replication was discussed as being a crucial building block to keeping data safe 24x7 with no risk of data loss even if a disaster occurred.

  • Thursday, October 21 - 2004 at 10:24


sponsored link
related stories
Why Availability Matters
Replication is the obvious next step to ensuring higher service levels after a solid backup plan has been deployed. But if backup is the building block of a disaster recovery strategy and replication is the next step with real-time data protection, where does system availability fit in?

'If the service centre stops functioning, we cannot serve our customers - which means that no newspapers would be produced. In other words, if we had a shutdown, it would paralyze 30 percent of the Norwegian press.'
Gunstein Løken, Operations and Development Manager, Orkla Media Service Senter IT


Backup and replication technologies both deal with minimising data loss, but neither technology can help keep systems alive - even during disasters. For maximum system availability we need to look at other alternatives. Traditionally, if backup is the only technology used to protect systems then the only way to get back to business is to do a restore from disk or tape, which can take anywhere from a couple of hours to days or even weeks. An example of how crucial it is to keep systems alive comes from a few years back when a large online brokerage company suffered 4 system outages within 2 months, resulting in a 22% dip in the stock price as customers lost faith in the company.

Minimise downtime


To minimise downtime as well as data loss, a combination of clustering and replication must be used. Clustering is simply the process of moving a failed application on a system that is experiencing a disaster to a working system, whether in the same data centre or in another location. This process can take anywhere from seconds to minutes.

What is a Cluster?
Before clustering emerged as a viable technology for keeping systems alive, users simply connected to systems and if the systems went down the users would be paralysed from doing anything until that system was fixed.

This also meant that if you were the administrator for an IT environment you would be held responsible for getting the system back up and running as quickly as possible and everyone would be hounding you until the system was fixed.

server failure

Figure 2. In the above environment, if the server went down the entire IT environment would be crippled and unavailable.

Though the concept of clustering had been available for many years on mainframes, it wasn't until the 90's that it became widespread on open systems, such as Windows, Unix and Linux. With clustering IT administrators could now secure access to systems and minimise downtime by having extra systems available in case of failure.

another system taking over in case of failure

Figure 3. By having another system available to take over in case of system outages, downtime can be minimised. If one system fails, another takes over.

How does clustering work?
Clustering is by no means magic, it simply automates the process of rebuilding a system and starting an application on a standby system. Without clustering rebuilding a system can take a very long time, because first the operating system has to be installed, then applications, then patches downloaded and applied, then the system has to be configured, etc.

And all this time is considered downtime where there is no access to the system. It does not matter if the system is an external web-based transactional server or an internal e-mail server, any kind of system outage is likely to have widespread effects.

Since most IT environments today are very complex all the layers of a data centre has to be considered. If we take a look at a traditional example of a 3-tier environment with a web front-end, an ERP application in the middle, and a database back-end, if any of these 3 systems were to crash, then access to all systems would be hampered.

This illustrates the importance of protecting all layers of the data centre, as downtime regardless of where in the IT environment, is considered downtime to the end-user.

If one system fails...

Figure 4. If one system fails, then all the other systems become inaccessible.

Implementing an availability strategy is like putting together a house of cards and one oversight can bring down the entire house of cards in a matter of seconds and users would be soon be knocking down your door to get access back.

A well-known example of this was in 1999 when Ebay suffered a site outage for 22 hours and had to pay back an estimated $5 million in auction fees. It is worth noting that Ebay now use the entire VERITAS Disaster Recovery solution suite, including backup, replication and clustering to keep systems and data alive.

Another example of how important it is to consider every aspect of the data centre was when Orbitz (a large US travel web site) suffered two outages within 8 days in July of 2003. Orbitz ended up blaming the crash on an unstable database sitting in the back-end.

Increasing Availability with New Cluster Methods
Now that we understand the value of clustering, it is worth taking a look at some of the new methods of clustering that can lower cost and increase availability, so we get maximum benefit out of our data centre.

When clustering first became a viable technology for increasing availability the traditional method used was active/passive (also known as asymmetric) clustering, which is simply two systems connected together with one system being active and another being passive and ready to take over in case of system or application failure.

Soon businesses realised that having a standby server sitting idle in the data centre was a waste of resources since servers are expensive and having them do nothing for most of the time was not a good return on investment. Because of this realisation, an alternative to active/passive clustering became prevalent, known as active/active clustering.

Active/active (also known as symmetric) clustering is similar to active/passive clustering except that both systems are now active and ready to take over for the other if one was to fail. This method of clustering has the benefit of lowering costs by utilising both systems at the same time.

The downside of this method is that if one system fails the remaining system has double the system load and performance of the applications might suffer, which might as well mean they are unavailable. Another aspect of this method is the complexity of ensuring that there are no application conflicts so both applications can run side by side without bringing the other down. For example, would you want to run SQL Server and Exchange on the same server?

Active/passive clustering

Figure 5. Active/passive clustering on the left, and active/active clustering on the right.

Due to the high costs and availability issues of the two clustering methods another method soon emerged that could help solve both problems. This method is called N+1 clustering, where you have a cluster of 3 or more systems (VERITAS supports up to 32 systems connected together in a cluster) all connected to the same storage and ready to failover to any of the other servers.

N+1 clustering

Figure 6. N+1 clustering takes the best of both active/passive and active/active clustering and provides maximum availability with low cost and no performance slowdown or complexity issues.

In the above example we have a 5-system cluster where 4 servers are active and one server is passive ready to take over for any of the other servers if they should fail. There are many benefits to this method:


• Low cost: In a traditional active/passive cluster 8 servers would be needed to achieve the same level of availability. By using N+1 clustering we can reduce the amount of servers to 5 and still maintain the same level of availability.


• In terms of actual cost, consider a server costing €5,000 each (excluding maintenance), €15,000 can be saved by buying 3 less servers.


• No performance degradation: In N+1 clustering there is always a dedicated server ready to take over should one fail. This means that any server will only be running one application at a time with no performance degradation.


• No complexity issues: Because there is a dedicated standby server there is less risk of failures due to incompatible software running on the same system.


• Time savings: Instead of having to manage 4 clusters of 2 nodes, one cluster of 5 nodes can be managed more easily saving the administrator time.

In N+1 clustering there is always a dedicated standby server. If one system fails the standby server takes over. Once the failed server has been repaired, it becomes the new standby server.

How Availability Relates to Disaster Recovery
When thinking about availability, it is important to remember that disasters can be big or small. A single server crashing and bringing down access to an entire data centre can certainly be considered a disaster and should be planned for. Equally important is it to plan for more widespread disasters, such as fires, floods, power outages, terrorist attacks, etc. which affect an entire site.

When deciding on a sound availability strategy it is key that the solution can protect systems regardless of physical location. Anything from local availability to metropolitan disaster recovery to wide area disaster recovery needs to be covered.

availability solution

Figure 7. A good availability solution should be able to protect systems locally, as well as over a metro or wide area to protect against disasters.

Here is a quick overview of each architecture:

• Local Clustering: A single cluster located in one building. If one system fails another takes over locally.


• Metropolitan Disaster Recovery:


• Using Remote Mirroring: A single cluster stretched out between two sites connected with fibre channel or SAN. Sites are usually less than 100KM apart. If one server fails it can failover locally or remotely to the second site.


• Using Replication: A single cluster stretched out between two sites connected over an IP network. Distances can be further apart than with mirroring, but not usually more than a couple of hundred kilometres. If one server fails it can failover locally or remotely to the second site.


• Wide Area Disaster Recovery: Two separate networks control each site. If one site fails, all traffic gets redirected to the second site. This architecture supports unlimited distances.

Summary
In today's networked economy having systems available 24 hours a day is critical in ensuring success for any organisation, which is why availability should not just be focused on data, but also on servers and applications.

With only 5% of all organisations using availability solutions for mission-critical systems, many are at risk of losing transactions, revenue, brand image, or even worse (source: VERITAS Disaster Recovery Survey, September 2004).

A good question to ask yourself is how much would 1 hour of downtime cost your business in both external and internal costs? The answer alone should provide justification for implementing an application availability solution.

'One time we came in Monday morning and everything was running as usual. We didn't even realize till later that the server was down and VERITAS Cluster Server had done a fail-over. It was seamless.'
Bill Augustadt, Chief Architect and Technologist, BlueStar Solutions





Symantec Symantec, Middle East
Thursday, October 21 - 2004 at 10:24 UAE local time (GMT+4)

Replication or redistribution in whole or in part is expressly prohibited without the prior written consent of AME Info FZ LLC / Emap Limited.

This Article was updated on Wednesday, January 31 - 2007


Disclaimer:
Articles in this section are primarily provided directly by the companies appearing or PR agencies which are solely responsible for the content. The companies concerned may use the above content on their respective web sites provided they link back to http://www.ameinfo.com

Any opinions, advice, statements, offers or other information expressed in this section of the AME Info Web site are those of the authors and do not necessarily reflect the views of AME Info FZ LLC / Emap Limited. AME Info FZ LLC / Emap Limited is not responsible or liable for the content, accuracy or reliability of any material, advice, opinion or statement in this section of the AME Info Web site.

For details about submitting your stories, please read the guide - all content published is subject to our terms and conditions

Sponsored Links

Email newsletters

Business Directory »

The news you choose

News and Articles »

Current Events »

Advertisement »