Case Study: Backup management muddle cleared up
Black art changed to reliable, automated routine.
Managing and planning backups in a department is straightforward, difficult in a business office building. Doing it for a set of businesses in many towns is heading towards a nightmare.
Looking after backup management for sets of businesses in many towns and across many countries takes you ever closer to the doors of hell.
In any large and dispersed network, backup infrastructure can be complex. Backups can fail or can be delayed. These days sloppy backup management isn't tolerated.
Backups should be run when scheduled and take the time they are supposed to. The backup management task is to ensure that this simple objective happens.
While the objective is simple the task is anything but. The inter-dependencies between different technologies make it very difficult to find the causes behind failures and performance bottlenecks.
The real problems with data backups tend to lie with the distributed elements in the backup infrastructure (e.g., systems being backed up, the network, etc.) rather than with the centralised backup server and its software.
Finding these problems is often a black art since the backup administrator generally has little or no control over the servers being backed up.
This problem affected Orange Business Services (OBS). It is one of the world’s largest international telecommunications companies and operates the world’s most extensive data network in support of its various data services. It provides services for over 800 businesses in more than 140 countries.
For these customers data backup is a mission-critical component of their business. The poor backup administrator can encounter many problems:-
• Unsuccessful backups can expose Orange Business Services’ customers to lost data and/or services.
• Service level guarantees are not met which often mean financial penalties.
• Performance bottlenecks can cause backups not to complete within their prescribed time windows. This results in an additional strain on the production infrastructure during the normal business day.
• Backups not completing within the time window often causes the backup data to be inaccurate due to scheduling chains or interdependencies.
• Backups running out of window time are often very difficult to detect. These problems can go undetected or may only be noticed when the late running backups cause a larger production problem.
• Backup jobs may, in some cases, never run. Operations are unaware of the problem because they received no failure or out-of-window notifications.
Doug Bovie, OBS’ IT project manager in its Systems Integration and Engineering Group, wanted to gather information on the performance of the individual elements in the backup path, bring that information together and then automatically ‘correlate’ the data looking for failures, performance bottlenecks and potential failures.
Great idea. How could he do it? He could develop his own software but there wasn't time enough in the day or budget enough inside Orange. In fact, he found a software product that did what he wanted.
WysDM for Backups software was designed to gather and correlate information from applications, servers, networks and storage in order to diagnose and solve the sorts of problwems described above. It has a cross-domain performance correlation ability which OBS uses to isolate the root cause of failures and performance bottlenecks.
Bovie said: “WysDM for Backups allows us to correlate performance and configuration information across backup servers and software, tape libraries and drives, networks, client devices and storage arrays. This enables us to consolidate this data across technology domains and uncover the true cause behind infrastructure failures and performance problems.”
It also doesn't generally need agent software on remote servers and so is easy for a central IT organisation to set up. The list of supported servers and backup applications is long. Modules are preconfigured to access information from a variety of applications and devices such as:
• Legato Networker software
• Veritas NetBackup software
• Quantum tape drives and libraries
• EMC Symmetrix storage arrays
• Fibre Channel and IP switches
• Solaris Servers and backup clients
• NT-based servers and backup clients
• Linux based servers and backup clients.
This data is then consolidated into WysDM’s data mine for long term storage, management and analysis.
A Predictive Analysis Engine (PAE) continually scans the data looking for performance issues and patterns, alerting operation staff when they are found. Bovie said: “The analysis engine takes the worry out of backups. It lets my staff know about potential issues before they affect the backups.”
The WysDM reporter generates reports for operators, engineers and management, pointing out short or long-term potential issues either from a performance or infrastructure. Reports can be generated out of the box to show the:-
• Top 10 slowest clients
• Media utilisation by pool
• Charge back reports
• Most unreliable servers (most backup failures)
• Performance correlation matrix
• Most common errors
• Backup SLA reports
• Last good backup by server or group.
Bovie's staff can also create customised reports.
What they can actually do is to remove their management of one of the world's most complicated backup environments from the doors of hell and locate it in a dependable, affordable and efficient place instead. They also save money which would previously have been spent buying hardware to deal with problems.
Bovie said: “By showing me where my true problems are, WysDM allows me to stop ‘throwing hardware at the problem.’ I can now optimise and increase the utilisation of my current infrastructure." Doing more without spending money is satisfying to any business manager.
Bovie is so pleased he said: “(Having) WysDM for Backups is like a team of 24x7 storage experts watching my backups.”