Browse
related articles
E-Business - always available
- Monday, May 10 - 2004 at 13:25
High availability—or the ability of a business and that business's customers to use an application or a service at the appropriate time with the level of functionality they expect—is no longer just a key requirement of today's business systems; it's increasingly becoming the key requirement.
"More and more organizations' business processes are extremely dependent on their applications and infrastructure, so if there are any outages in the environment, it's very costly from a business perspective," says Donna Scott, vice president and distinguished analyst at Gartner, Inc., in Stamford, Connecticut. "It takes a very well-designed and well-managed environment to mitigate the risks of downtime and prevent it from happening or, if you can't prevent it from happening, to enable as quick a recovery as possible."
In its simplest form, a high-availability plan encompasses three aspects: resiliency (making sure applications and systems are as reliable as possible), recoverability (ensuring that if a component does fail, there's a way to recover within a given time period), and continuous operation (ensuring that systems or applications are available even during maintenance activities).
Whereas disaster recovery is concerned with getting core business (and IT) systems up and running again after an unplanned outage, high availability involves proactively ensuring that there's no single point of failure for essential systems (be they applications, databases, networks, storage, or any other IT component). "High availability involves both planned and unplanned outages," says Charles Garry, senior program director of infrastructure services at Meta Group, Inc. "Planned outages occur for things like application upgrades, hardware upgrades, patches, and basic maintenance. Unplanned outages can include user error, operator error, and actual hardware failure."
In fact, although most people probably think about a server crashing, a disk controller dying, hardware or software glitches, or catastrophic disasters when the topic of high availability comes up, there's another side to the story. "Forty percent of downtime, on average, is caused by operations errors," says Scott. "Oracle, like the rest of the industry, focuses on trying to reduce manageability requirements for software products. This not only reduces the amount of labor needed to manage them but also reduces the chance for errors from touching a product."
But high availability isn't just about databases, application servers, networks, backups, or new grid technologies; it's also about users and their expectations—giving users what they want, when they expect it, at the level of performance they expect. Advances in a wide variety of technologies, including products such as Oracle Real Application Clusters (RAC) and Oracle Data Guard as well as new functionality such as the Flashback capabilities in Oracle Database 10g, not only make it easier to meet these user expectations but also significantly reduce the manageability and development requirements associated with highly available applications.
"We're not trying to build a system in which no piece ever fails; we're trying to build a system in which we can tolerate the failures of individual pieces and have the system as a whole keep going," says Juan Loaiza, Oracle vice president of Data and Systems Technologies. "We've integrated several technologies into our stack that basically allow organizations to assemble a lot of low-cost servers, storage, and network technologies to create a highly available system that is also highly scalable. It's unbreakable and inexpensive."
Oracle: Integrating High Availability
Traditionally, high-availability solutions were implemented by organizations that had mission-critical applications that had to be up and running no matter what.
They often required specialized hardware and software, as well as expensive consulting services, for correct design and implementation. One result of this approach was that most organizations felt that high-availability solutions were beyond their financial abilities or resources and simply didn't implement them. The other result of this approach was that many organizations have ended up backing into high-availability solutions because of some catastrophic event that highlighted their vulnerability.
"High-availability spending is often driven by a crisis situation," says Gartner's Scott. "For example, a company's e-mail became corrupted, so its systems went down for three days and the CEO became involved in making changes. That's the sort of thing that often happens."
It doesn't have to be that way. Recent advances in everything from hardware to software to operations management are making it easier to increase availability without having to sell the farm or hire a team of consultants. For example, many hardware components are now coming with redundancy built in. And although organizations still need to design high availability into specific applications, more infrastructure components are coming with high-availability capabilities already integrated, making it easier and less expensive for organizations to create highly available solutions.
"Oracle has been very strong in its focus on designing for availability," says Scott. "For example, RAC provides a very highly available environment, because you can recover in less than a minute if there's a failure. That's a very, very strong value proposition from an availability perspective."
But implementing Oracle RAC isn't the only way to increase availability. In addition to the steps outlined in the sidebar section "Steps to Higher Availability," you can use a wide variety of Oracle products and capabilities (some of which you probably already own) to decrease potential downtime. Keep in mind that many of these products may have multiple effects on enterprise capabilities—for example, implementing Oracle RAC not only increases availability but also increases scalability.
Consider the following products and capabilities, which can all have a positive effect on availability when implemented:
• Oracle Real Application Clusters (RAC). Oracle RAC provides clustering capabilities, so that if one node in a cluster goes down, the users can transparently fail over to another node. For example if a user is querying the database and trying to retrieve thousands of rows and one node in a cluster fails, the system takes care of migrating the user and the query to another node that's available and continues the operation.
• Oracle Data Guard. Data Guard provides the ability to create, maintain, manage, and monitor standby databases as transactionally consistent copies of the production database. Oracle9i Release 2 provides additional capabilities such as Data Guard SQL Apply (logical standby database), which allows additional objects such as indexes and materialized views to be created in the standby database, enabling it to be used for reporting while protecting your data from possible disasters.
• Maximum Availability Architecture (MAA). Oracle's MAA provides a practical framework based on best practices for implementing high availability across key Oracle solutions, including Oracle RAC and Data Guard.
• Oracle Enterprise Manager (OEM). In addition to providing complete enterprise management capabilities for Oracle products, OEM helps reduce downtime by proactively monitoring systems and configurations; its automation capabilities greatly reduce the potential for downtime due to operator error.
• Backup and Recovery (Recovery Manager). Oracle provides a variety of backup and recovery capabilities, including Recovery Manager (RMAN) and the new Flashback capabilities in Oracle Database 10g, which help reduce operator errors associated with backing up and recovering files as well as significantly decreasing the time required to restore a database to a given point in time.
• Automatic Storage Management (ASM). New in Oracle Database 10g, ASM automatically manages the underlying storage systems, eliminating the time-consuming (and potentially error-prone) need for DBAs to manage files and drives individually. Like Oracle RAC's clustering capabilities, ASM enables organizations to build highly available clusters of inexpensive storage devices that are highly resilient.
• Grid capabilities. Oracle's core products can all be used in an enterprise grid environment, offering a new level of resource and data availability. In an enterprise grid, resources and data can be dynamically provisioned to meet computing demands, keeping systems functioning optimally.
These features and products (as well as many others) contribute to increasing the availability of an application, but you should keep the overall picture in mind. "High availability is not a specific product," says Murali Vallath, an independent Oracle consultant with Summersky Enterprises, based in Charlotte, North Carolina, and the president of the Oracle Real Application Cluster Special Interest Group (SIG).
"It comes from the entire enterprise architecture, combined with the features of the infrastructure applications that run on it." Luckily, with Oracle9i and Oracle Database 10g, Oracle has provided a solid set of infrastructure components that integrate high-availability capabilities into the heart of the enterprise IT system.
Resiliency and Recoverability in Practice
It is exactly these types of advances in software and infrastructure support for high availability that have enabled Steve Sorem, CTO and principal of Payment Technologies, Inc., in Mechanicsburg, Pennsylvania, to be confident that his company can process hundreds of thousands of payment transactions in real time for companies such as Home Depot, Chase Merchant Services, Liberty Mutual Insurance, First Data Corporation, and even Exxon Mobil Corporation's SpeedPass Network of gas station payment tags. "I think we've reached a level of availability that will be hard for us to exceed," says Sorem.
Broadly speaking, PayTec provides hosted transaction management services to financial processors and their large retail customers. In effect, it is combining a customer relationship management (CRM) product and a payment processing product that handles individual payment transactions and associates those purchases with multiple other accounts, tracking a consumer's accumulated points in a loyalty program or in one that provides discounts or rewards, for example.
A key attribute of the system, named Valexia, is that it has to operate in real time and be available 24/7.
To accomplish this, PayTec is using Oracle RAC to enable both scalability and availability. "We've been very satisfied with Oracle9i RAC, because it's really helped us from a high-availability perspective," says Don Smith, vice president of information services, Payment Technologies. "We had been using Oracle8 Parallel Server and noticed nice improvements in speed, performance, and administrative costs when we moved to Oracle9i RAC."
PayTec uses a three-tier architecture: a front adapter tier, where the point-of-sale and other transactions come in; a second application tier, comprising Pure Java J2EE applications running on JBoss servers; and the third, database tier, running Oracle9i RAC (9.2.0.4) on two Sun V880s. Although PayTec knew that RAC would theoretically fail over from one database node to another when there was a problem, it was gratified to learn that it actually worked when put to the test. "This stuff is real. You'll have one node lock up or have a hardware or network hit, and Oracle Database will automatically fail over to the second node," says Sorem. "We had several server-related failures early on, and Oracle RAC really saved us in those cases."
Not only has the company's Oracle RAC implementation saved the applications from going down but it's also made maintenance significantly easier. With RAC, PayTec can shift all processes over to the other node and then shut the first server down and do maintenance. "It's beautiful at the database tier to be able to cycle one of those database nodes—in a nonpeak period, of course—and shift all the transactional activity to the other node," says Smith.
"We leverage some of the Oracle Resource Manager components to lower the priorities on our back-office processes when we're doing that." After maintenance, PayTec cycles one node and brings it back up, and the application recognizes that the node has reappeared and starts migrating appropriate transactional activity to the right node.
In addition to using Oracle RAC to ensure availability, PayTec takes advantage of other Oracle features—Data Guard and Recovery Manager—to provide recoverability in the case of a major, unplanned outage at its main facility.
"We're a hosted platform and have some pretty stringent
Making IT Available
SLAs [service-level agreements] with some of our customers, with some as high as 99.99 percent uptime," says Smith. "So in the event of a disaster, to keep our uptime, we have implemented an Oracle Data Guard standby database in Maximum Performance mode." The standby database is located at a disaster recovery site 130 miles away, connected via SONNET ring with two 10-Mbps interconnects.
"We use Oracle9i time-based log switching, so every 10 minutes (at the most), we are automatically switching a log and pushing it up to the disaster recovery site. Thus, our worst-case scenario is a potential 10 minutes of transactional loss," adds Sorem. At peak periods, PayTec's Valexia is operating at more than 6.5 million I/O operations per hour, with a throughput of 500,000 transactions per hour.
During these times, the system creates a 500MB log every three minutes that has to be moved to a disaster recovery facility in a round-robin fashion over the two network links.
Although Valexia was originally built on Oracle8i, Smith is thrilled with some of the new features in Oracle9i that make it easier for PayTec to ensure availability. For example, Oracle networking's integration into Oracle9i meant that the company didn't have to write its own scripts to push the logs to the disaster recovery site.
Instead, Data Guard in Oracle9i has a completely automated redo-data-transport and apply mechanism that takes care of keeping the disaster recovery site up to date.
"It's very clean and stable and works really well," Sorem says. "These automatic features greatly reduce the administrative costs, time, and resources required as well as dependency on third-party products." In fact, with just one primary DBA, PayTec is able to manage 16 Oracle instances, including its production environment, disaster recovery environment, and others.
24/7 Availability
Although ensuring that a payment or a transaction goes through correctly is a business-critical process for an organization such as Payment Technologies, Austrian Railways (ÖBB) is more concerned that its 10,000 kilometers of track, 16,000 switches, 240 tunnels, 6,000 bridges, and 6,700 road crossings are safe for railway passengers.
Historically, ÖBB had left responsibility for its rail networks to regional authorities, with the information stored in local databases on paper records. As a result, accurately measuring the quality of the track and planning strategic maintenance and upgrades were difficult and expensive.
However, since 1996 ÖBB has radically redesigned and centralized the management of its track network to provide new competitive services and ensure the physical reliability of the track. Core to this initiative has been ÖBB's implementation of Oracle RAC running on a six-node HP AlphaServer Tru64-Cluster housed in two locations separated by 1.5 km and connected via a fiber connection.
"High availability has become more important for us since more and more business-critical processes are supported by IT solutions," says Friedrich Brimmer, CIO, Austrian Railways Fahrweg, the ÖBB's infrastructure department. "Our integrated database applications have more than a thousand users accessing more than 10TB of data to handle and manage maintenance information 24 hours a day."
For example, ÖBB has railroad cars that continuously circle the entire rail network, measuring track tolerances every 25 centimeters and transmitting the measured data via wireless LAN to the central database. Applications, including ones using a geographic information system (GIS) and the Oracle Spatial option, can then extract and analyze data to calculate variations in the track conditions by running on multiple CPUs in the clustered environment, resulting in detailed information on reliability deficiencies that can be sent to maintenance crews within 24 hours.
"In terms of deploying high-availability solutions, we've learned that it's important to plan carefully and check the resiliency and cooperation of all hardware and software components under exceptional conditions and severe stress tests," says Brimmer. "If you're going to build a cluster system, it's important to do so on a stable operating system and a stable database, such as Oracle."
Although collecting and analyzing railway data by using Oracle RAC is an "always on" application for ÖBB that's critical to the safety and security of its customers, it's also helping the bottom line. "We believe that by combining technologies such as Oracle RAC with process efficiencies, we can increase productivity by more than 70 percent over five years," Brimmer adds.
Reducing Human Error
"High availability for us usually starts with the twin data center concept, where we deliver an application running on two systems, located in different data centers that are interconnected by something like a fiber connection," says Erik Snel, manager of Oracle Run Services at Atos Origin, a global IT services company with 50,000 employees located in Hoofddorp, The Netherlands. "If one of the two fails, the other can take over."
That's why Atos Origin, which manages more than 1,800 Oracle databases in Holland alone, uses Oracle9i Data Guard to enable that failover between the locations. With that many customers, ensuring availability is important. "For example, with our warehouse management systems, there are a lot of people standing around and not doing any work and products that can't be delivered to their customers if we fail to deliver high availability. The impact is very high when we can't deliver the service," adds Nico Sponselee, an Atos Origin technical consultant.
"We find that moving beyond 99.9 percent availability is not easy, and Oracle definitely helps us try to reach the goal of adding another 9 after the decimal," says Snel, who's excited about the new capabilities of Oracle Database 10g. "We think that Oracle Database 10g is another step in that direction. It will help us make even higher availability possible at even less cost, specifically because we think the human error component of an operating system and an application becomes less. Human failure can be driven down by decreasing interaction by humans, and Oracle Database 10g helps in that respect."
When authorized people make mistakes, you need the tools to correct these errors. Oracle Database 10g provides a family of human-error correction technologies called Flashback, which revolutionizes data recovery. In the past, it might have taken minutes to damage a database but hours to recover it. With Flashback the time to correct errors equals the time it took to make the error.
Flashback provides fine-grained surgical analysis and repair for localized damage—such as when the wrong customer order is deleted. Flashback also allows for correction of more-widespread damage and lets you do it quickly to avoid long downtime—such as when all of this month's customer orders have been deleted. Flashback is unique to the Oracle Database and supports recovery at all levels, including the row, transaction, table, and tablespace levels and databasewide.
The Importance of Rapid Recovery
For most businesses, there's no way around the dramatic growth of database sizes over the past five years, and First American Real Estate Solutions (RES) is no exception. "We have approximately 600,000 users who rely on our 2TB real estate database to make decisions daily," says Ben Graboske, chief technical officer at First American RES, in Anaheim Hills, California. "We operate 24/7 to meet the needs of our clients who often access our services outside of normal business hours. From that perspective, high availability is really important to us."
The problem is, restoring such huge databases after a hardware failure or a system crash takes a bit longer than two or three minutes. That's why First American RES implemented Oracle Data Guard on its key billing and customer tracking databases—to ensure rapid recovery and continued availability in the event of a problem. "It takes a long time to restore a large database from a backup database," says Daniel Liu, senior technical consultant at First American RES. "With so many users and so many applications, there is minimal to no downtime. I think the greatest value of Data Guard is that it enables us to efficiently handle failover on large databases."
More than 600,000 professionals depend on First American RES to provide them with the information they need; although not all of those individuals access the system at the same time, First American typically has thousands of concurrent users on the system at any one time. And although First American works hard to collect, consolidate, and manage data from approximately 3,000 U.S. counties, it wouldn't be a Fortune 500 company if it couldn't accurately track and bill each of its hundreds of thousands of customers.
"We have multiple billing databases that are key to our business," says Liu. "Because we charge users by how many searches they do and how much time they spend searching our database, we need to ensure availability. So for each of the billing databases, we've used Oracle Data Guard to set up standby databases for protection."
According to Liu, First American RES uses two standby databases for each billing database, and it uses the Maximum Performance setting within Data Guard to push the transactions as quickly as possible to the remote site so that during a failover, all the data will already be on the standby databases. "The process actually goes very quickly. We can have all the users, more than 2,000 of them concurrently, fail over to our standby site in less than a few minutes without losing any transactions at all."
Key to deploying any type of high-availability solution is testing. "Before we implemented our high-availability solution, we built a testing and staging environment in which we could mimic failures and cause the database or server to crash and fail over to the standby database," says Liu. Apparently the testing and extra work have paid off. Since the company deployed the system, it has had two hardware-related failures, and both times the Data Guard failover process worked successfully and within minutes—thousands of users were shifted to the new primary server without a problem.
Using Best Practices to Increase Availability
With a concept as broad as high availability, it can be hard for organizations to figure out where to start. And when you're designing the high-availability infrastructure for Oracle's Outsourcing business, it's even tougher. Oracle Outsourcing had to take into account the wide variety of availability and disaster recovery solutions that are uniquely designed for individual customer scenarios and find a way to deliver them as standard services with well-defined SLAs and customer expectations.
That's where the Oracle Maximum Availability Architecture (MAA) came in. MAA is the overall guideline or blueprint that customers can use to implement high availability with Oracle solutions. It's a thorough document that not only gives generalized architecture advice but also gets down to specific configuration details and settings based on best practices.
"We've built our recovery strategy around the MAA architecture, so we can tell customers exactly what to expect when something goes wrong," says Ken Piro, vice president, Oracle Outsourcing. "Even if we lose our Austin data center, we can tell them they can expect to be back in business in this amount of time, with this amount of data loss. We couldn't do that without an architecture such as MAA in place—it gives us amazing flexibility. For example, we can even have failover architectures based on Data Guard where the primary site uses EMC storage and the secondary site uses Network Appliance storage."
Although it's comprehensive, MAA is also very flexible, so you can implement it incrementally. With MAA, you don't have to do it all at once but can decide what's most important for your specific scenario. "For example, if you have a primary site, you may decide that disaster recovery is important, so you implement Data Guard," says Tammy Bednar, Oracle senior product manager, High Availability. "Then you might decide that the next step is to protect your primary database from a host failure, so you implement Oracle RAC. You just continue to add your levels of data protection into your high-availability architecture."
The Grid Moves High Availability to New Level
Until recently, high availability has been measured in figures under 100 percent. It simply hasn't been possible to consider a system that never goes down and isn't susceptible to some failure, no matter how remote, so a highly available system might be guaranteed to be available 99.99 percent of the time. New grid technologies, however, may just offer the
promise of even more uptime.
"Grid computing takes the issue of high availability to a different level," says Piro. "Grid is making people realize that performance, availability, and capacity are all wrapped together and have to be there 100 percent of the time. Customers expect them to work together. Grid takes all these requirements and puts them together in one uniform concept."
With Oracle9i, Oracle began offering grid technologies—features such as Transportable Tablespaces and Oracle Streams—that dynamically provision pooled resources and data to the users and programs that need them. With Oracle Database 10g, all Oracle core technologies are now enabled for use in an enterprise grid environment, providing even higher availability.
Oracle Database 10g enables a new model for high availability by combining high-volume, inexpensive processors and inexpensive storage to produce a high-quality system. New capabilities such as disk-based recovery, Flashback, and rolling upgrades now make it possible for Oracle users to build highly available IT infrastructures with low-cost, high-volume components. Says Oracle's Loaiza, "There's a new economics that organizations should be aware of: trading inexpensive disk space for expensive downtime. With Oracle Database 10g, we're enabling organizations to spend less money on hardware but still achieve resilient, high-availability deployments."
Oracle Database 10g also helps organizations increase availability by providing ways to compensate for human error (such as a DBA deleting or restoring the wrong files). "When you're talking about availability and recovery from every angle, it's not just system errors that are important; it's data availability as well," says Summersky's Vallath. "In Oracle Database 10g you have the Flashback option, dynamic redefinition and reconfiguration options, enhanced backup and recovery options, and more.
All these great features give you an extra advantage for achieving maximum availability."
Across the board, Oracle Database 10g's support for high availability now provides organizations with a new opportunity to reevaluate their organization's high-availability strategy and move toward one that utilizes low-cost, standard hardware and storage. By integrating availability into the fabric of the computing infrastructure the way grid computing does, organizations will have even more flexibility to use their IT resources in ways that make the most business sense.
Browse
related articles
Disclaimer:
The information comprised in this section is not, nor is it held out to be, a solicitation of any person to take any form of investment decision. The content of the AMEinfo.com Web site does not constitute advice or a recommendation by AME Info FZ LLC / Emap Limited and should not be relied upon in making (or refraining from making) any decision relating to investments or any other matter. You should consult your own independent financial adviser and obtain professional advice before exercising any investment decisions or choices based on information featured in this AMEinfo.com Web site.
AME Info FZ LLC / Emap Limited can not be held liable or responsible in any way for any opinions, suggestions, recommendations or comments made by any of the contributors to the various columns on the AMEinfo.com Web site nor do opinions of contributors necessarily reflect those of AME Info FZ LLC / Emap Limited.
In no event shall AME Info FZ LLC / Emap Limited be liable for any damages whatsoever, including, without limitation, direct, special, indirect, consequential, or incidental damages, or damages for lost profits, loss of revenue, or loss of use, arising out of or related to the AMEinfo.com Web site or the information contained in it, whether such damages arise in contract, negligence, tort, under statute, in equity, at law or otherwise.
Oracle Middle East
