Case Study: A network-upgrade horror story, part 1: laying the plans
Key lessons for the rest of us from UCSF's experience.
By Jeffrey Fritz, Network World | Network World US | Published: 01:00, 06 August 2007
It sounded like a no-brainer when we started the project in 2003. The idea was to replace the aging 155Mbit/s ATM-over-SONET network at the University of California, San Francisco (UCSF) with a new network based on 10G Ethernet over DWDM.
Nobody could have imagined the glitches, snafus and legal hold-ups that we ran into over the past four years. We finally issued the contract for the project in May, and, needless to say, we learned a bunch of valuable lessons along the way. Here's the whole saga, from the beginning:
During the nine years that the ATM-over-SONET system has been in place, the metropolitan network has grown to 55,000 nodes encompassing two San Francisco campuses, four hospitals and more than 200 remote sites, including regional clinics spread throughout California. The campus network also has evolved into an essential, mission-critical utility, right up there with water and electric power.
Reliability had become a worry, however. Of great concern was the ticking clock: network devices that were at - or rapidly heading toward - end-of-life. That means no vendor support for such essentials as software patches, technical support and replacement of failed hardware components. Cisco's support for the Catalyst 5500s and LS-1010s was waning.
In addition, the demands of video distribution, telemedicine and medical-imaging technologies were quickly making the network outdated. It lacked QoS or multicast capabilities. That meant email, Web surfing, video and medical images all got the same "best effort" treatment. Video packets were broadcast indiscriminately, causing bottlenecks and congestion. Applications that needed greater bandwidth or QoS, such as those used for remote clinician consultation and patient diagnosis and medical research, could not be carried efficiently - or at all - on the network.
Clear sailing in the design phase
In the summer of 2003, a design team of network technologists from campus IT, several campus departments and the medical centre began to think about a new network. We considered what technologies offered the best mix of price and performance and which offered the greatest capability for expansion and the lowest risk of downtime.
DWDM quickly became a front-runner in terms of the potential technology. It can scale over time from eight lambdas (lightwave channels) all the way to 32 protected lambdas or 64 unprotected lambdas.
DWDM would provide a graceful evolution for the network's ever-increasing demands for capacity and capability. Each individual lambda running as fast as 2.5Gbit/s can carry a different service. For example, we could run the production Ethernet network over one lambda and a high-definition video feed over another. Or we could choose to provide a secure second Ethernet network for the medical centre to connect the university's hospital facilities. This would let secure, electronic, protected health information move across the medical centre's clinical network without coming in contact with student and faculty traffic on the campus network.
Then there is the matter of protected and unprotected lambdas. The bane of any optical-fibre-based network is the feared fibre cut. DWDM offers the option of protected lambdas, which run in one direction in the DWDM ring, while working lambdas run in the other direction.
Most DWDM gear has protection-switching that senses the loss of signal from the failed working lambdas and switches to the protected lambdas in less than 50 microseconds. There are few if any network applications that would notice that short an outage.
To add even more resilience, we engineered in topology reliability. The new network was designed with diversely routed, dual-concentric rings at the main sites. Thus, a fibre cut or optical failure would have to take out both rings to cause a network failure. Even then, protected lambdas would take over.
Now we had the basis for the new network, which we christened UCSF's Next Generation Metropolitan Area Network (NGMAN).
NGMAN is made up of core and secondary sites. The core consists of the two main campuses and a central administrative building. San Francisco General Hospital, Mount Zion Medical Complex, Laurel Heights Conference Centre and the Veterans Administration Medical Centre are secondary sites.
Core sites are the locations with the heaviest traffic demands. They also are the sites with the most users. Therefore, they have the highest bandwidth (10Gbit/s) and the most resiliency. Most secondary sites connect to the core in a point-to-point fashion using unprotected lambdas running at 1Gbit/s or 10Gbit/s, depending on their traffic requirements.
The product of building reliability on top of reliability was a resilient, redundant and self-healing network that could survive such events as earthquakes and bioterrorism - an important consideration for a patient care network in a seismically active area. In fact, NGMAN's design let it achieve five-nines of reliability - no more than 5.26 minutes of down-time a year.
UCSF has a "build it and they will come" philosophy. We don't build things frivolously, but we do build them on faith. The university built an entirely new campus at Mission Bay hoping to attract top medical researchers from around the world. A number of educators and researchers in fact made their way to UCSF and wound up doing their research in the new state-of-the-art Mission Bay buildings, which were outfitted with high-performance networks.
There was an element of "build it and they will come" in the NGMAN project as well. The network was built to support future medical applications. It needed to be high-performance and support QoS and multicast. It had to support high-definition video distribution, IP telephony and real-time medical imaging. And it had to be scalable.
We chose a modular approach to minimise forklift upgrades. Modularity extended to more than just the equipment. We intended the modular concept to allow for adding and deleting secondary sites easily. If a site didn't need the full capabilities of DWDM, we could bring it online via alternative technologies, such as optical metropolitan Ethernet service or leased services.
Tomorrow, part 2: Bidding the project.