miercuri, 21 noiembrie 2007

The VA's computer systems meltdown: What happened and why

At times, the bad news coming from the U.S. Department of Veterans Affairs seems unstoppable: D-grade medical facilities, ongoing security and privacy breaches, and a revolving door of departing leadership. In September, during a hearing by the House Committee on Veterans' Affairs, lawmakers learned about an unscheduled system failure that took down key applications in 17 VA medical facilities for a day.

Characterized by Dr. Ben Davoren, the director of clinical informatics for the San Francisco VA Medical Center, as "the most significant technological threat to patient safety the VA has ever had," the outage has moved some observers to call into question the VA's direction in consolidating its IT operations. Yet the shutdown grew from a simple change management procedure that wasn't properly followed.

The small, undocumented change ended up bringing down the primary patient applications at 17 VA medical centers in Northern California. As a result, the schedule to centralize IT operations across more than 150 medical facilities into four regional data processing centers has been pulled back while VA IT leaders establish what the right approach is for its regionalization efforts.

The Region 1 Field Operations breakdown of Aug. 31 exposed just how challenging effecting substantial change is in a complex organization the size of the VA Office of Information & Technology (OI&T). Begun in October 2005 and originally scheduled to be completed by October 2008, the "reforming" of the IT organization at the VA involved several substantial goals: the creation of major departments along functional areas such as enterprise development, quality and performance, and IT oversight and compliance; the reassignment of 6,000 technical professionals to a more centralized management; and the adoption of 36 management processes defined in the Information Technology Infrastructure Library (ITIL).

As part of the reform effort, the VA was to shift local control of IT infrastructure operations to regional data-processing centers. Historically, each of the 150 or so medical centers run by the VA had its own IT service, its own budget authority and its own staff, as well as independence with regard to how the IT infrastructure evolved. All of the decisions regarding IT were made between a local IT leadership official and the director of that particular medical center. While that made on-site IT staff responsive to local needs, it made standardization across sites nearly impossible in areas such as security, infrastructure administration and maintenance, and disaster recovery.

The operations of its 150 medical facilities would relocate to four regional data processing centers, two in the east and two in the west. The latter, Regions 1 and 2, are located in Sacramento, Calif., and Denver respectively, run as part of the Enterprise Operations & Infrastructure (OPS) office.

A difficult day

On the morning of Aug. 31, the Friday before Labor Day weekend, the Region 1 data center was packed with people. According to Director Eric Raffin, members of the technical team were at the site with staffers from Hewlett-Packard Co. conducting a review of the center's HP AlphaServer system running on Virtual Memory System and testing its performance.

About the same time, staffers in medical centers around Northern California starting their workday quickly discovered that they couldn't log onto their patient systems, according to congressional testimony by Dr. Bryan D. Volpp, the associate chief of staff and clinical informatics at the VA's Northern California Healthcare System. Starting at about 7:30 a.m., the primary patient applications, Vista and CPRS, had suddenly become unavailable.

Vista, Veterans Health Information Systems and Technology Architecture, is the VA's system for maintaining electronic health records. CPRS, the Computerized Patient Record System, is a suite of clinical applications that provide an across-the-board view of each veteran's health record. It includes a real-time order-checking system, a notification system to alert clinicians of significant events and a clinical reminder system. Without access to Vista, doctors, nurses and others were unable to pull up patient records.

At the data center in Sacramento, with numerous technicians as witnesses, systems began degrading with no apparent cause.

Instantly, technicians present began to troubleshoot the problem. "There was a lot of attention on the signs and symptoms of the problem and very little attention on what is very often the first step you have in triaging an IT incident, which is, 'What was the last thing that got changed in this environment?'" Raffin said.

The affected medical facilities immediately implemented their local contingency plans, which consist of three levels: the first level of backup is a fail-over from the Sacramento Data Center to the Denver Data Center -- handled at the regional level, and the second level of backup is accessing read-only versions of the patient data. The final level of backup is tapping a set of files stored on local PCs at the sites containing brief summaries of a subset of data for patients who are on-site or who have appointments in the next two days, according to Volpp.

Volpp assumed that the data center in Sacramento would move into the first level of backup -- switching over to the Denver data center. It didn't happen.

According to Raffin, the platform has been structured to perform synchronous replication between the two data centers in Sacramento and Denver. "That data is written simultaneously in both facilities before the information system moves to the next stream or thread that it's processing," Raffin said. "At any instant in time, the same data lives in Sacramento that [lives] in Denver." The systems are built in an autonomous model, he said, so that if something strikes only one facility, the other data center won't be affected.

A failure to fail over

On Aug. 31, the Denver site wasn't touched by the outage at all. The 11 sites running in that region maintained their normal operations throughout the day. So why didn't Raffin's team make the decision to fail over to Denver?

On that morning, as the assembled group began to dig down into the problem, it also reviewed the option of failing over. The primary reason they chose not to, Raffin said, "was because we couldn't put our finger on the cause of the event. If we had been able to say, 'We've had six server nodes crash, and we will be running in an absolutely degraded state for the next two days,' we would have been able to very clearly understand the source of our problem and make an educated guess about failing over. What we faced ... was not that scenario."

What the team in Sacramento wanted to avoid was putting at risk the remaining 11 sites in the Denver environment, facilities that were still operating with no glitches. "The problem could have been software-related," Raffin says. In that case, the problem may have spread to the VA's Denver facilities as well. Since the Sacramento group couldn't pinpoint the problem, they made a decision not to fail over.

Greg Schulz, senior analyst at The Storage I/O Group, said the main vulnerability with mirroring is exactly what Region 1 feared. "If [I] corrupt my primary copy, then my mirror is corrupted. If I have a copy in St. Louis and a copy in Chicago and they're replicating in real time, they're both corrupted, they're both deleted." That's why a point-in-time copy is necessary, Schulz continued. "I have everything I need to get back to that known state." Without it, the data may not be transactionally consistent.

At the affected medical facilities, once the on-site IT teams learned that a fail-over wasn't going to happen, they should have implemented backup stage No. 2: accessing read-only patient data. According to Raffin, that's what happened at 16 of the 17 facilities affected by the outage.

But the process failed at the 17th site because the regional data center staff had made it unavailable earlier in the week in order to create new test accounts, a procedure done every four to six months. From there, medical staff at that location had no choice but to rely on data printed out from hard disks on local PCs.

According to Volpp, these summaries are extracts of the record for patients with scheduled appointments containing recent labs, medication lists, problem lists and notes, along with allergies and a few other elements of the patient record. "The disruption severely interfered with our normal operation, particularly with inpatient and outpatient care and pharmacy," Volpp says.

The lack of electronic records prevented residents on their rounds from accessing patient charts to review the prior day's results or add orders. Nurses couldn't hand off from one shift to another the way they were accustomed to doing it -- through Vista. Discharges had to be written out by hand, so patients didn't receive the normal lists of instructions or medications, which were usually produced electronically.

Volpp said that within a couple of hours of the outage, "most users began to record their documentation on paper," including prescriptions, lab orders, consent forms, and vital signs and screenings. Cardiologists couldn't read EKGs, since those were usually reviewed online, nor could consultations be ordered, updated or responded to.

In Sacramento, the group finally got a handle on what had transpired to cause the outage. "One team asked for a change to be made by the other team, and the other team made the change," said Raffin. It involved a network port configuration. But only a small number of people knew about it.

More importantly, said Raffin, "the appropriate change request wasn't completed." At the heart of the problem was a procedural issue. "We didn't have the documentation we should have had," he said. If that documentation for the port change had existed, Raffin noted, "that would have led us to very quickly provide some event correlation: Look at the clock, look at when the system began to degrade, and then stop and realize what we really needed to do was back those changes out, and the system would have likely restored itself in short order."

According to Evelyn Hubbert, an analyst at Forrester Research Inc., the outage that struck the VA isn't uncommon. "They don't make the front page news because it's embarrassing." Then, when something happens, she said, "it's a complete domino effect. Something goes down, something else goes down. That's unfortunately typical for many organizations."

Schulz concurred. "You can have all the best software, all the best hardware, the highest availability, you can have the best people," Schulz said. "However, if you don't follow best practices, you can render all of that useless."

When the Region 1 team realized what needed to happen, it made the decision to shut down the 17 Vista systems running from the Sacramento center and bring them back up one medical facility at a time, scheduled by location -- those nearing the end of their business day came first. Recovery started with medical sites in the Central time zone, then Pacific, Alaska and Hawaii. By 4 p.m., the systems in Northern California facilities were running again.

But, according to Volpp, although Vista was up, the work wasn't over. Laboratory and pharmacy staffers worked late that Friday night to update results and enter new orders and outpatient prescriptions into the database. Administrative staffers worked for two weeks to complete the checkouts for patients seen that day. "This work to recover the integrity of the medical record will continue for many months, since so much information was recorded on paper that day," he says.

A shortage of communication

During the course of the day, said Volpp, affected facilities didn't receive the level of communication they'd been accustomed to under the local jurisdiction model of IT operation. As he testified to Congress, "During prior outages, the local IT staff had always been very forthcoming with information on the progress of the failure and estimated length even in the face of minimal or no knowledge of the cause. To my knowledge, this was absent during the most recent outage."

Raffin denies this. "There were communications," he said. "There most certainly were." But, he acknowledged, they were not consistent or frequent enough, nor did they inform the medical centers sufficiently about implementing their local contingency plan. "It was," he said, "a difficult day."

Once the team realized what it needed to do to bring the systems back to life, Region 1 began providing time estimates to the medical facilities for the restoration of services.

The rift exposes a common problem in IT transformation efforts: Fault lines appear when management reporting shifts from local to regional. According to Forrester's Hubbert, a major risk in consolidating operations is that "even though we're thinking of virtualizing applications and servers, we still haven't done a lot of virtualization of teams." More mature organizations -- she cited high-tech and financial companies -- have learned what responsibilities to shift closer to the user.

Workforce reshaping

Raffin said that iI was never the intent of the realignment to downgrade the level of service experienced by people in the medical facilities. "The message I send to my folks in my organization is, 'You may work ultimately for me within OI&T, but you absolutely work for the network or facility where you're stationed,'" he said. The goal, Raffin said, was to create "that bench strength we've never had."

As an example, Raffin points to a coding compliance tool, an application that exists at all 33 medical centers in his jurisdiction that all run the same version on the same system. "There was a sliver of a [full-time employee] at every medical center that was supporting this application," he said. "There was no structure [for] maintenance and upgrades, no coordination in how we handled problem management." When a problem surfaced, 33 trouble tickets would be logged, Raffin said.

As part of the reorganization, Region 1 has set up a systems team, which includes an applications group. Two people within that group are now coordinating the management of that particular application. "It's a team approach," he said.

Likewise, similar to the argument made by companies that move employees to a service provider during an outsourcing initiative, Raffin claimed that the reassignment of personnel to an organization dedicated to IT will ultimately result in greater opportunities for them and better succession planning for OI&T.

"The only competing interest I have with regards to training are other IT folks, who need other IT training," he said. "I'm not competing with nursing education or with folks who need safety education because they operate heavy machinery at a medical center."

Along the way, that training includes an education in change management process, one of the ITIL best practices being adopted by OI&T that was "new to our IT folks," said Raffin. "They may have read it, but I'm not sure they got it."

Dr. Paul Tibbits is deputy CIO for Enterprise Development -- one of the newly created functional areas within OI&T. Tibbits pointed out that under the previous management structure, "there would have been a lot of competition for mental energy on the part of a hospital director. Does he get his IT staff to read this stuff or not read this stuff?" Under a single chain of command, that education will most assuredly take place, he said.

Tibbits' organization is taking a different approach from Region 1 in how it develops staff skills in the four ITIL processes under his charge. The three-phased approach he described involves real-time coaching and mentoring for "short-term change"; classes, conferences and workshops for midterm change; and updates in recruiting practices for the long term.

"We're hiring outside contractors to stand at the elbows and shoulders of our IT managers through the development organization to watch what they do on a day-by-day basis," said Tibbits. That effort has just begun, he said, with contractors "just coming on board now."

On the other hand, Region 1 under Raffin's leadership has introduced a three-part governance process. The first part is a technical component advisory council, which meets weekly to discuss and prioritize projects. "That is where a lot of training has occurred," said Raffin. Second, a regional governance board also meets weekly to discuss issues related to IT infrastructure. In addition, Raffin is about to implement a monthly meeting of an executive partnership council that will include both IT people and "business" representatives from the medical facilities being served.

Will bringing people together for meetings suffice to meet the needs of transforming the work habits of the 4,000 people who are now part of OPS -- what Tibbits classifies as a "workforce-reshaping challenge?" And will it prevent the kind of outage that happened on Aug. 31 from happening again somewhere else?

Tibbits sweeps aside a suggestion that the centralization of IT played a role in the outage. "Had the IT reorganization never happened, this error might have happened on Aug. 31 anyway because somebody didn't follow a procedure," he said.

Forrester's Hubbert sees the value in bringing together teams within IT to look at operations more holistically. "That's what change agents need to do -- to lay IT on its side instead of keeping it in silos ... to have that end-to-end picture," she said. Plus, that's an effective way to address shortfalls in process and bring staff along as part of the overall transformation effort, Hubbert adds. "Usually, if you take IT people into the boat and ask them what to fix, if you say, 'Hey, this is the whiteboard. Let's figure it out from there all the way back to the root cause,' they have a real willingness to cooperate," she said. From there, they can develop a process to prevent the same type of problem from surfacing again.

Region 1 Fallout

When an event takes place that impairs the operations of 17 federally funded medical centers, investigations and reviews tend to follow.

n the case of Region 1, that includes an internal review of the regional data processing initiative by both the IT and Oversight & Compliance and Information Protection and Risk Management organizations, which report to Gen. Bob Howard, assistant secretary for OI&T, as well as a review coordinated by an unnamed outside firm. Raffin said he expects those reviews to be concluded early in 2008. And although that review was actually scheduled as part of the OI&T's spending plan, he acknowledged that "it's happening a little earlier than we wanted it to."

Until those results are in, the OI&T has put a "soft hold" on migrating additional medical centers into the regional data center concept, said Raffin. "From Region 1's perspective, we were almost 90% complete and should have been 100% complete by Nov. 9. Our project schedule is going to be a little delayed," he said.

Also, Howard has directed the OI&T development organization to work with the infrastructure engineering organization to design a series of system topologies that would provide varying degrees of reliability, availability, maintainability and speed, "up to and including one option that would be 'zero downtime,'" Tibbits said. "I don't think there's any question in anyone's mind that 128 data centers is too many. One might be too few. But what exactly the optimal topology is, all of that is in play right now. Regionalization of some form is alive and well and will move forward."

Region 1 has experienced a dramatic improvement in compliance, Raffin said, "with folks documenting changes in advance of their occurrence." The next phase of that will be an automated system using tools from CA Inc., which are already in use in the VA's Austin Automation Center. He expects that to be implemented within 90 days.

Region 1 has also modified procedures related to the read-only version of records maintained by Vista, the Level 2 backup plan that wasn't fully available on Aug. 31. Now, Raffin said, those systems are more consistently checked for round-the-clock availability and "any system maintenance ... is properly recorded through our change management procedure."

According to Davoren, the medical director in San Francisco, "before regionalization of IT resources -- with actual systems that contained patient information in distributed systems -- it would have been impossible to have 17 medical centers [go] down." As he told a congressional committee in September, the August system outage was "the longest unplanned downtime that we've ever had at San Francisco since we've had electronic medical records." This was proof to Davoren and others at the individual medical centers that in creating a new structure "in the name of 'standardization,'" support would "wane to a lowest common denominator for all facilities," he said.

Raffin isn't ready to give up. He recognizes that an event like the one that happened on Aug. 31 "casts a long shadow" against what he sees as a number of accomplishments. But he also maintains confidence that Region 1 -- and all of OI&T -- has the ability to pull off its transformation. "For me, it's about making sure we're listening to all of our folks and have our ears to the pavement at the medical centers to make sure we understand what our business requirements are," he said.

Change is hard, especially when it's undertaken on such a massive scale. The difficulty was foreseen early on by VA CIO Howard. "This will not be an easy or quick transformation. There will be a few difficulties along the way, and it's natural for some people to be uncomfortable with change on such a scale. But the prospect of more standardization and interoperability we can harness through this centralization is exciting," Howard said in a webcast speech to the IT workforce of the VA shortly after his confirmation hearings by the Senate Committee on Veterans Affairs.

A question remains whether the VA OI&T is moving quickly enough to keep the confidence of its numerous constituencies -- patients, medical staff, VA executives and lawmakers. As U.S. Rep. Bob Filner (D-Calif.), chair of the House Committee on Veterans' Affairs, stated during that September hearing, "We are heartened by many of the steps the VA has undertaken, but remain concerned that more should be done, and could be done ... faster."

Dian Schaffhauser is a writer who covers technology and business for a number of print and online publications. Contact her at

Niciun comentariu: