miercuri, 21 noiembrie 2007

The VA's computer systems meltdown: What happened and why

At times, the bad news coming from the U.S. Department of Veterans Affairs seems unstoppable: D-grade medical facilities, ongoing security and privacy breaches, and a revolving door of departing leadership. In September, during a hearing by the House Committee on Veterans' Affairs, lawmakers learned about an unscheduled system failure that took down key applications in 17 VA medical facilities for a day.

Characterized by Dr. Ben Davoren, the director of clinical informatics for the San Francisco VA Medical Center, as "the most significant technological threat to patient safety the VA has ever had," the outage has moved some observers to call into question the VA's direction in consolidating its IT operations. Yet the shutdown grew from a simple change management procedure that wasn't properly followed.

The small, undocumented change ended up bringing down the primary patient applications at 17 VA medical centers in Northern California. As a result, the schedule to centralize IT operations across more than 150 medical facilities into four regional data processing centers has been pulled back while VA IT leaders establish what the right approach is for its regionalization efforts.

The Region 1 Field Operations breakdown of Aug. 31 exposed just how challenging effecting substantial change is in a complex organization the size of the VA Office of Information & Technology (OI&T). Begun in October 2005 and originally scheduled to be completed by October 2008, the "reforming" of the IT organization at the VA involved several substantial goals: the creation of major departments along functional areas such as enterprise development, quality and performance, and IT oversight and compliance; the reassignment of 6,000 technical professionals to a more centralized management; and the adoption of 36 management processes defined in the Information Technology Infrastructure Library (ITIL).

As part of the reform effort, the VA was to shift local control of IT infrastructure operations to regional data-processing centers. Historically, each of the 150 or so medical centers run by the VA had its own IT service, its own budget authority and its own staff, as well as independence with regard to how the IT infrastructure evolved. All of the decisions regarding IT were made between a local IT leadership official and the director of that particular medical center. While that made on-site IT staff responsive to local needs, it made standardization across sites nearly impossible in areas such as security, infrastructure administration and maintenance, and disaster recovery.

The operations of its 150 medical facilities would relocate to four regional data processing centers, two in the east and two in the west. The latter, Regions 1 and 2, are located in Sacramento, Calif., and Denver respectively, run as part of the Enterprise Operations & Infrastructure (OPS) office.

A difficult day

On the morning of Aug. 31, the Friday before Labor Day weekend, the Region 1 data center was packed with people. According to Director Eric Raffin, members of the technical team were at the site with staffers from Hewlett-Packard Co. conducting a review of the center's HP AlphaServer system running on Virtual Memory System and testing its performance.

About the same time, staffers in medical centers around Northern California starting their workday quickly discovered that they couldn't log onto their patient systems, according to congressional testimony by Dr. Bryan D. Volpp, the associate chief of staff and clinical informatics at the VA's Northern California Healthcare System. Starting at about 7:30 a.m., the primary patient applications, Vista and CPRS, had suddenly become unavailable.

Vista, Veterans Health Information Systems and Technology Architecture, is the VA's system for maintaining electronic health records. CPRS, the Computerized Patient Record System, is a suite of clinical applications that provide an across-the-board view of each veteran's health record. It includes a real-time order-checking system, a notification system to alert clinicians of significant events and a clinical reminder system. Without access to Vista, doctors, nurses and others were unable to pull up patient records.

At the data center in Sacramento, with numerous technicians as witnesses, systems began degrading with no apparent cause.

Instantly, technicians present began to troubleshoot the problem. "There was a lot of attention on the signs and symptoms of the problem and very little attention on what is very often the first step you have in triaging an IT incident, which is, 'What was the last thing that got changed in this environment?'" Raffin said.

The affected medical facilities immediately implemented their local contingency plans, which consist of three levels: the first level of backup is a fail-over from the Sacramento Data Center to the Denver Data Center -- handled at the regional level, and the second level of backup is accessing read-only versions of the patient data. The final level of backup is tapping a set of files stored on local PCs at the sites containing brief summaries of a subset of data for patients who are on-site or who have appointments in the next two days, according to Volpp.

Volpp assumed that the data center in Sacramento would move into the first level of backup -- switching over to the Denver data center. It didn't happen.

According to Raffin, the platform has been structured to perform synchronous replication between the two data centers in Sacramento and Denver. "That data is written simultaneously in both facilities before the information system moves to the next stream or thread that it's processing," Raffin said. "At any instant in time, the same data lives in Sacramento that [lives] in Denver." The systems are built in an autonomous model, he said, so that if something strikes only one facility, the other data center won't be affected.

A failure to fail over

On Aug. 31, the Denver site wasn't touched by the outage at all. The 11 sites running in that region maintained their normal operations throughout the day. So why didn't Raffin's team make the decision to fail over to Denver?

On that morning, as the assembled group began to dig down into the problem, it also reviewed the option of failing over. The primary reason they chose not to, Raffin said, "was because we couldn't put our finger on the cause of the event. If we had been able to say, 'We've had six server nodes crash, and we will be running in an absolutely degraded state for the next two days,' we would have been able to very clearly understand the source of our problem and make an educated guess about failing over. What we faced ... was not that scenario."

What the team in Sacramento wanted to avoid was putting at risk the remaining 11 sites in the Denver environment, facilities that were still operating with no glitches. "The problem could have been software-related," Raffin says. In that case, the problem may have spread to the VA's Denver facilities as well. Since the Sacramento group couldn't pinpoint the problem, they made a decision not to fail over.

Greg Schulz, senior analyst at The Storage I/O Group, said the main vulnerability with mirroring is exactly what Region 1 feared. "If [I] corrupt my primary copy, then my mirror is corrupted. If I have a copy in St. Louis and a copy in Chicago and they're replicating in real time, they're both corrupted, they're both deleted." That's why a point-in-time copy is necessary, Schulz continued. "I have everything I need to get back to that known state." Without it, the data may not be transactionally consistent.

At the affected medical facilities, once the on-site IT teams learned that a fail-over wasn't going to happen, they should have implemented backup stage No. 2: accessing read-only patient data. According to Raffin, that's what happened at 16 of the 17 facilities affected by the outage.

But the process failed at the 17th site because the regional data center staff had made it unavailable earlier in the week in order to create new test accounts, a procedure done every four to six months. From there, medical staff at that location had no choice but to rely on data printed out from hard disks on local PCs.

According to Volpp, these summaries are extracts of the record for patients with scheduled appointments containing recent labs, medication lists, problem lists and notes, along with allergies and a few other elements of the patient record. "The disruption severely interfered with our normal operation, particularly with inpatient and outpatient care and pharmacy," Volpp says.

The lack of electronic records prevented residents on their rounds from accessing patient charts to review the prior day's results or add orders. Nurses couldn't hand off from one shift to another the way they were accustomed to doing it -- through Vista. Discharges had to be written out by hand, so patients didn't receive the normal lists of instructions or medications, which were usually produced electronically.

Volpp said that within a couple of hours of the outage, "most users began to record their documentation on paper," including prescriptions, lab orders, consent forms, and vital signs and screenings. Cardiologists couldn't read EKGs, since those were usually reviewed online, nor could consultations be ordered, updated or responded to.

In Sacramento, the group finally got a handle on what had transpired to cause the outage. "One team asked for a change to be made by the other team, and the other team made the change," said Raffin. It involved a network port configuration. But only a small number of people knew about it.

More importantly, said Raffin, "the appropriate change request wasn't completed." At the heart of the problem was a procedural issue. "We didn't have the documentation we should have had," he said. If that documentation for the port change had existed, Raffin noted, "that would have led us to very quickly provide some event correlation: Look at the clock, look at when the system began to degrade, and then stop and realize what we really needed to do was back those changes out, and the system would have likely restored itself in short order."

According to Evelyn Hubbert, an analyst at Forrester Research Inc., the outage that struck the VA isn't uncommon. "They don't make the front page news because it's embarrassing." Then, when something happens, she said, "it's a complete domino effect. Something goes down, something else goes down. That's unfortunately typical for many organizations."

Schulz concurred. "You can have all the best software, all the best hardware, the highest availability, you can have the best people," Schulz said. "However, if you don't follow best practices, you can render all of that useless."

When the Region 1 team realized what needed to happen, it made the decision to shut down the 17 Vista systems running from the Sacramento center and bring them back up one medical facility at a time, scheduled by location -- those nearing the end of their business day came first. Recovery started with medical sites in the Central time zone, then Pacific, Alaska and Hawaii. By 4 p.m., the systems in Northern California facilities were running again.

But, according to Volpp, although Vista was up, the work wasn't over. Laboratory and pharmacy staffers worked late that Friday night to update results and enter new orders and outpatient prescriptions into the database. Administrative staffers worked for two weeks to complete the checkouts for patients seen that day. "This work to recover the integrity of the medical record will continue for many months, since so much information was recorded on paper that day," he says.

A shortage of communication

During the course of the day, said Volpp, affected facilities didn't receive the level of communication they'd been accustomed to under the local jurisdiction model of IT operation. As he testified to Congress, "During prior outages, the local IT staff had always been very forthcoming with information on the progress of the failure and estimated length even in the face of minimal or no knowledge of the cause. To my knowledge, this was absent during the most recent outage."

Raffin denies this. "There were communications," he said. "There most certainly were." But, he acknowledged, they were not consistent or frequent enough, nor did they inform the medical centers sufficiently about implementing their local contingency plan. "It was," he said, "a difficult day."

Once the team realized what it needed to do to bring the systems back to life, Region 1 began providing time estimates to the medical facilities for the restoration of services.

The rift exposes a common problem in IT transformation efforts: Fault lines appear when management reporting shifts from local to regional. According to Forrester's Hubbert, a major risk in consolidating operations is that "even though we're thinking of virtualizing applications and servers, we still haven't done a lot of virtualization of teams." More mature organizations -- she cited high-tech and financial companies -- have learned what responsibilities to shift closer to the user.

Workforce reshaping

Raffin said that iI was never the intent of the realignment to downgrade the level of service experienced by people in the medical facilities. "The message I send to my folks in my organization is, 'You may work ultimately for me within OI&T, but you absolutely work for the network or facility where you're stationed,'" he said. The goal, Raffin said, was to create "that bench strength we've never had."

As an example, Raffin points to a coding compliance tool, an application that exists at all 33 medical centers in his jurisdiction that all run the same version on the same system. "There was a sliver of a [full-time employee] at every medical center that was supporting this application," he said. "There was no structure [for] maintenance and upgrades, no coordination in how we handled problem management." When a problem surfaced, 33 trouble tickets would be logged, Raffin said.

As part of the reorganization, Region 1 has set up a systems team, which includes an applications group. Two people within that group are now coordinating the management of that particular application. "It's a team approach," he said.

Likewise, similar to the argument made by companies that move employees to a service provider during an outsourcing initiative, Raffin claimed that the reassignment of personnel to an organization dedicated to IT will ultimately result in greater opportunities for them and better succession planning for OI&T.

"The only competing interest I have with regards to training are other IT folks, who need other IT training," he said. "I'm not competing with nursing education or with folks who need safety education because they operate heavy machinery at a medical center."

Along the way, that training includes an education in change management process, one of the ITIL best practices being adopted by OI&T that was "new to our IT folks," said Raffin. "They may have read it, but I'm not sure they got it."

Dr. Paul Tibbits is deputy CIO for Enterprise Development -- one of the newly created functional areas within OI&T. Tibbits pointed out that under the previous management structure, "there would have been a lot of competition for mental energy on the part of a hospital director. Does he get his IT staff to read this stuff or not read this stuff?" Under a single chain of command, that education will most assuredly take place, he said.

Tibbits' organization is taking a different approach from Region 1 in how it develops staff skills in the four ITIL processes under his charge. The three-phased approach he described involves real-time coaching and mentoring for "short-term change"; classes, conferences and workshops for midterm change; and updates in recruiting practices for the long term.

"We're hiring outside contractors to stand at the elbows and shoulders of our IT managers through the development organization to watch what they do on a day-by-day basis," said Tibbits. That effort has just begun, he said, with contractors "just coming on board now."

On the other hand, Region 1 under Raffin's leadership has introduced a three-part governance process. The first part is a technical component advisory council, which meets weekly to discuss and prioritize projects. "That is where a lot of training has occurred," said Raffin. Second, a regional governance board also meets weekly to discuss issues related to IT infrastructure. In addition, Raffin is about to implement a monthly meeting of an executive partnership council that will include both IT people and "business" representatives from the medical facilities being served.

Will bringing people together for meetings suffice to meet the needs of transforming the work habits of the 4,000 people who are now part of OPS -- what Tibbits classifies as a "workforce-reshaping challenge?" And will it prevent the kind of outage that happened on Aug. 31 from happening again somewhere else?

Tibbits sweeps aside a suggestion that the centralization of IT played a role in the outage. "Had the IT reorganization never happened, this error might have happened on Aug. 31 anyway because somebody didn't follow a procedure," he said.

Forrester's Hubbert sees the value in bringing together teams within IT to look at operations more holistically. "That's what change agents need to do -- to lay IT on its side instead of keeping it in silos ... to have that end-to-end picture," she said. Plus, that's an effective way to address shortfalls in process and bring staff along as part of the overall transformation effort, Hubbert adds. "Usually, if you take IT people into the boat and ask them what to fix, if you say, 'Hey, this is the whiteboard. Let's figure it out from there all the way back to the root cause,' they have a real willingness to cooperate," she said. From there, they can develop a process to prevent the same type of problem from surfacing again.

Region 1 Fallout

When an event takes place that impairs the operations of 17 federally funded medical centers, investigations and reviews tend to follow.

n the case of Region 1, that includes an internal review of the regional data processing initiative by both the IT and Oversight & Compliance and Information Protection and Risk Management organizations, which report to Gen. Bob Howard, assistant secretary for OI&T, as well as a review coordinated by an unnamed outside firm. Raffin said he expects those reviews to be concluded early in 2008. And although that review was actually scheduled as part of the OI&T's spending plan, he acknowledged that "it's happening a little earlier than we wanted it to."

Until those results are in, the OI&T has put a "soft hold" on migrating additional medical centers into the regional data center concept, said Raffin. "From Region 1's perspective, we were almost 90% complete and should have been 100% complete by Nov. 9. Our project schedule is going to be a little delayed," he said.

Also, Howard has directed the OI&T development organization to work with the infrastructure engineering organization to design a series of system topologies that would provide varying degrees of reliability, availability, maintainability and speed, "up to and including one option that would be 'zero downtime,'" Tibbits said. "I don't think there's any question in anyone's mind that 128 data centers is too many. One might be too few. But what exactly the optimal topology is, all of that is in play right now. Regionalization of some form is alive and well and will move forward."

Region 1 has experienced a dramatic improvement in compliance, Raffin said, "with folks documenting changes in advance of their occurrence." The next phase of that will be an automated system using tools from CA Inc., which are already in use in the VA's Austin Automation Center. He expects that to be implemented within 90 days.

Region 1 has also modified procedures related to the read-only version of records maintained by Vista, the Level 2 backup plan that wasn't fully available on Aug. 31. Now, Raffin said, those systems are more consistently checked for round-the-clock availability and "any system maintenance ... is properly recorded through our change management procedure."

According to Davoren, the medical director in San Francisco, "before regionalization of IT resources -- with actual systems that contained patient information in distributed systems -- it would have been impossible to have 17 medical centers [go] down." As he told a congressional committee in September, the August system outage was "the longest unplanned downtime that we've ever had at San Francisco since we've had electronic medical records." This was proof to Davoren and others at the individual medical centers that in creating a new structure "in the name of 'standardization,'" support would "wane to a lowest common denominator for all facilities," he said.

Raffin isn't ready to give up. He recognizes that an event like the one that happened on Aug. 31 "casts a long shadow" against what he sees as a number of accomplishments. But he also maintains confidence that Region 1 -- and all of OI&T -- has the ability to pull off its transformation. "For me, it's about making sure we're listening to all of our folks and have our ears to the pavement at the medical centers to make sure we understand what our business requirements are," he said.

Change is hard, especially when it's undertaken on such a massive scale. The difficulty was foreseen early on by VA CIO Howard. "This will not be an easy or quick transformation. There will be a few difficulties along the way, and it's natural for some people to be uncomfortable with change on such a scale. But the prospect of more standardization and interoperability we can harness through this centralization is exciting," Howard said in a webcast speech to the IT workforce of the VA shortly after his confirmation hearings by the Senate Committee on Veterans Affairs.

A question remains whether the VA OI&T is moving quickly enough to keep the confidence of its numerous constituencies -- patients, medical staff, VA executives and lawmakers. As U.S. Rep. Bob Filner (D-Calif.), chair of the House Committee on Veterans' Affairs, stated during that September hearing, "We are heartened by many of the steps the VA has undertaken, but remain concerned that more should be done, and could be done ... faster."

Dian Schaffhauser is a writer who covers technology and business for a number of print and online publications. Contact her at

The holiday shopper's guide to laptops

Looking to buy a laptop this holiday season? The choices can be mind-boggling, with countless models and configurations to choose from.

In fact, though, it's not that tough to figure out which laptop to buy, and then get a great deal on it. Follow our advice, and you won't go wrong.

The most basic decision you'll make, of course, is whether to go with a Mac or a PC. As with religion, this is a personal choice upon which we won't impinge. So we'll start off with advice for a PC, then provide information for buying a Mac laptop. We'll end our guide with tips for finding laptop bargains.

If you buy a Windows laptop

Let's start with the basics -- the processor. It's this simple: Buy a laptop with dual-core processor, such as Intel's Core Duo mobile or Core 2 Duo mobile (the Core 2 Duo is faster than the Core Duo), or the AMD Athlon 64 X2 Dual Core processor or AMD Turion 64 X2 Dual Core processor (the Turion is faster than the Athlon).
For most users, the speed of the processor itself doesn't matter too much as long as it's dual-core. Dual-core processors are faster than single cores -- particularly when multitasking -- and save power as well, so you'll get longer battery life with them.

You may also find laptops with the Intel Core 2 Extreme mobile processor, which has four cores instead of two. As a practical matter, four cores won't make a dramatic difference compared to two cores, considering that applications haven't yet been written to take advantage of four cores. So if a four-core laptop costs a good deal more than a two-core one, it's probably not worth the extra money.

For RAM, consider 1GB a minimum, and get more if you can afford it. A 2GB laptop will have sufficient power for just about anything a typical user will do, although you might want to opt for a 4GB laptop for a hardcore gamer.

Most people overlook one of the most important laptop specs -- graphics processing. Frequently, laptops use an integrated graphics controller rather than a separate graphics card, which can be problematic not only for gamers, but even for those running Windows Vista Home Premium.

Unless you know the recipient is going to stick to computing basics such as e-mail and word processing, it's a good idea to get a notebook with a dedicated graphics controller, which can enhance such activities as managing a photo library or watching videos online. Gamers need a higher-end card, such as the Nvidia GeForce 8700M GT. If your recipient doesn't play games, though, a card such as the Nvidia GeForce 8400M GS will be fine.

As for how much graphics memory you need, you might want 512MB for gamers, while for general computing 256MB or even 128MB will do.

If you expect that your gift recipient's graphics needs will grow and that he might ultimately want to have more than one graphics processor in his laptop, look for machines that have Scalable Link Interface (SLI), which allows the laptop to use multiple graphics chips.

The rest of the laptop specs are fairly straightforward. You'll want as big a hard disk as you can reasonably afford (your recipient can always add external storage later), a DVD burner and a minimum one-year warranty. As a general rule, the larger the screen, the heavier the laptop and the shorter the battery life, so keep that in mind when buying. If your laptop recipient is a road warrior who spends a lot of time on long airplane flights, consider upgrading to a longer-lasting battery.

If possible, look for a laptop with built-in 802.11n wireless capabilities rather than just 802.11g. That way, when the 802.11n standard becomes widely used, the laptop will be able to take advantage of its faster speeds. Similarly, if you can get a Gigabit Ethernet connection built in, opt for that rather than the more common, slower Ethernet connection.

Finally, look for a laptop with as many slots as possible. If you care about expandability, you'll want a PC Card slot, and ideally, an ExpressCard slot as well. Both slots let you connect a wide variety of peripherals. You want not only USB 2.0 ports, but also FireWire (IEEE 1394) if you can get it. And look for card slots for removable media, such as CompactFlash, Secure Digital, SmartMedia, MultiMediaCard and Memory Stick, if you think your recipient will want to transfer photos or other media files to the laptop.

The price range on Windows laptops is considerable, depending on whether you want a low-end model with only the basics or a high-end screamer capable of playing the latest games. Prices do fluctuate, but you can usually find a laptop with a 15-in. screen, no separate graphics processor, an AMD Athlon 64 X2 Dual-Core CPU, 1GB of RAM and an 80GB hard drive for around $550 -- a Dell Inspiron 1501, for example. On the higher end, you can usually get a laptop such as the HP Pavilion dv6675us with a 2-GHz Intel Core 2 Duo Mobile T7200 CPU, 4GB of RAM, an Nvidia GeForce 8400M GS graphics controller with 128MB of memory, a 250GB hard drive and 802.11n wireless for between $1,500 and $1,700.

And if you need a full-bore machine capable of speedy high-end gaming, you'll have to spend a bundle. For example, an Alienware Area-51 m9750 with dual 512MB Nvidia GeForce Go 8700M GT chips with SLI, a 2.3-GHz Intel Core 2 Duo T7600 processor, 4GB of RAM, a 320GB hard drive, 802.11n wireless and a 17-in. monitor will set you back a whopping $3,600.

Going with a Mac

If you go for a Mac, your choices are much simpler than if you go the PC route, simply because there are far fewer Macs, with fewer variations. However, our recommendations for specs to look out for remain the same as with Windows laptops.

You'll choose between two lines: the MacBook Pro, available with 15- or 17-in. screen in a brushed aluminum case, and the smaller, lighter MacBook, which has a 13-in. screen in a white or black plastic case.

Both lines come standard with several of the items in our must-have list for laptop purchases:

  • An Intel Core 2 Duo processor
  • At least 1GB of RAM (MacBook Pros come with 2GB and all models can accommodate up to 4GB)
  • Gigabit Ethernet
  • 802.11n Wi-Fi
  • FireWire and USB 2.0 ports (MacBook Pros also have ExpressCard/34 slots)
  • A one-year warranty

They also include several "nice to haves," such as FireWire ports, built-in webcams, and Bluetooth connectivity, that you might pay extra for with PCs. (On the other hand, you could argue that you're paying for these features on Macs whether you want them or not.)

As general rule, the MacBook Pros tend to have higher-end specs than the MacBooks. For example, the MacBooks come with an integrated Intel GMA X3100 graphics processor with 144MB of RAM. To get a better graphics processor, you'll need to go with a MacBook Pro, which includes a slick Nvidia GeForce 8600M GT graphics processor with dual-link DVI support and either 128MB or 256MB of RAM.

That said, however, even the MacBooks offer a fair number of customization options including processor speed, hard drive size, amount of RAM and so on. MacBooks range from $1,100 to $1,500 before configuration, while MacBook Pros range from $2,000 to $2,800. As with Windows laptops, opting for more memory, a faster processor and/or a bigger hard drive can raise prices considerably.

Where to buy

Now that you've decided what to buy, it's time to put your money down. As a general rule, you'll get your best deals online rather than in a retail store, and you'll have more choice as well.But if you shop online, you won't actually get to put your hands on the laptops, and with laptops -- even more so than with desktops -- hands-on experience is important. So after you've narrowed down your choices, visit some retail stores and try out the laptops.

Next it's buying time. There are plenty of great deals to be had online, but often they only last for a day or so and then vanish. To find them, you need to go to bargain-hunting sites that scour the Internet for special deals and offers.

The best of the bargain-hunting sites is, which every day lists about a half-dozen new deals. Every once in a while, you'll find a great steal here. Dell laptops, in particular, often show up. Recently, for example, I found a Dell Inspiron E1405 Core 2 Duo laptop for $445 less than its normal price.

Keep in mind, though, that often these deals mean that you can't configure a laptop -- they're take-it-as-is-or-leave-it propositions. Other good bargain sites to try include Woot, DealCatcher and Ben's Bargains.

If you're shopping for a specific model rather than looking for a one-time deal at a bargain site, you should check out manufacturer sites as well as online retailers like and, because prices can vary considerably among them.

(This is less true of Macs, by the way, since Apple tends to enforce price uniformity. You can sometimes find rebates or add-ons like a free printer if you buy a Mac online, but you're unlikely to save hundreds of dollars.)

Also make sure to check out a price comparison site like PriceGrabber, which compares prices from multiple online retailers. Happy hunting!