Traverse City Record-Eagle [Michigan - ed.]
June 30, 2011
Munson has 4-hour communications failure
By Bill O'Brien
bobrien@record-eagle.com
TRAVERSE CITY — Munson Healthcare officials are trying to figure out how to avoid a repeat of a four-plus-hour data systems crash and "resultant chaos" that gripped local hospitals and clinics this week.
A system failure Tuesday morning shut down computers, telephones, pagers and other telecommunications systems at Munson Medical Center and its Munson Healthcare affiliates in Frankfort and Kalkaska, an incident that administrators described as "unacceptable." [That sounds about right - ed.]
Munson officials still aren't sure why a back-up fiber optic circuit failed during a planned outage that started Tuesday at 7:30 a.m.
"You can rest assured we're looking very carefully at that," Munson Medical Center CEO Kathleen McManus said. "Of course, we need to know what happened."
McManus said no patients were adversely affected during the outage. Even with the "resultant chaos" that gripped the local hospitals and clinics, no patients were adversely affected.
Amazing. There must be a cybernetic angel department in heaven that prevents patients from harm - i.e., from missed or delayed treatments or treatment mistakes - during the "resultant chaos" of major systems outages.
I'm sure insurers and risk managers have full confidence in Providence when these mishaps occur.
-- SS
July 1, 2011 update:
The Traverse City Record Eagle has published a memo explaining the outage:
Munson Healthcare officials distributed a memo on Wednesday that explained Tuesday’s systems failure that affected Munson Medical Center, Paul Oliver Memorial Hospital and Kalkaska Memorial Health Center and various clinics. The following memo is attributed to Chris Podges, Munson’s vice president of information systems.
“As you are all aware, we experienced an unplanned network downtime (Tuesday) that had widespread operational and clinical implications. Briefly, here is what happened:
Munson’s data centers’ connectivity to the outside world runs primarily on two redundant high speed fiber optic circuits administered by Traverse City Light and Power. We were informed by them that they needed to take one of the circuits off-line in order for them to do maintenance.
This would leave us operating on one circuit for the duration of their planned, 12-hour downtime. This shouldn’t have been any problem for us and is precisely why we have parallel, redundant technology on our most important systems and infrastructure. We have frequently tested for an event like this (losing one of the circuits) by manually “switching off” a fiber circuit.
In our testing, the remaining circuit took on all the traffic, just as it was architected to do; no hiccups, no instability, no impact on users, no downtime. And that is what we fully expected yesterday morning when one of the circuits was taken off line.
In medicine, or in any complex domain, one must expect the unexpected...
Unfortunately, that isn’t what happened. The core switch of the remaining circuit became confused, couldn’t take over the role as the primary switch (a transition which is measured in milliseconds) and ultimately shut down. Once down, everything running on the network – applications, paging systems, wireless devices, IP phones, etc., went down with it.
(I hear Kate Winslet humming in the background...)
Anthropomorphism aside, it seems to me that switches and other inanimate objects don't become "confused." The engineers/programmers who designed them just didn't anticipate the event that sinks the Titanic.
But let's roll the technology out nationally, now, rush...rush...rush...before it's too late. Computer bits get stale after awhile, after all....
Re: "Unplanned network downtimes." In the long running TV series Stargate SG-1 there's a character, Walter Harriman, who runs the command console. Just about the only line he gets to say over the base PA system over 10 seasons is ...
"UNSCHEDULED OFF-WORLD ACTIVATION!"
... when some person or alien attempts to come through from another Stargate somewhere in our galaxy or others nearby.
Perhaps hospitals can use him as a role model to announce "unscheduled down time"...
... During the downtime, we assembled a small army working on three objectives:
1. Make sure the hospitals and clinics could operate - especially as to the provision of patient care – on downtime procedures
2. Communicate as comprehensively and as often as we could
3. Fix the technical issues
Aside from the fact that these "objectives" are obvious, now hospitals need "small armies" to protect patients when a network switch gets "confused." Pen and paper offer no analogous bellicose hostilities.
While the reports are that all hospitals and clinics did a fantastic job surviving the down-time, we fully understand that it was very difficult to manage the resultant chaos and that downtimes like this are unacceptable.
But no patient was hurt (as they never are when the IT goes down).
Thursday morning at 2 a.m. we are going to re-introduce the second fiber optic circuit into our network architecture. While we expect no issues, we’re planning otherwise. This afternoon your organizations will receive specific instructions on how to prepare for the event of another network outage; What to print in advance of 2 a.m., what resources are available to you during the downtime, how to get needed clinical information without the use of computers, who to call for help, etc.
Thank god.
Again, we do not expect any downtime tomorrow morning, but we did re-learn some valuable lessons yesterday and the safety of our patients is the number one objective should the network experience another issue. We’ll be ready at 2 a.m. and we want your organizations to be ready, too.
Why do such "lessons" need to be "re-learned" - EVER?
We are working diligently to understand what happened yesterday and will share with you what we learn and our plans to remedy whatever may need attention.
In effect, they really don't understand what caused the outage.
This is similar to other cases of outages written about at Healthcare Renewal. There's never certainty, because these IT systems have become so complex and the support relatively diffused (via outsourcing, consultants, etc.)
This is a fertile ground for patient injury and death when luck runs out. As stories like "Failures in care alleged after premature birth - $1,000,000 Settlement" from the Virginia Lawyer's Weekly I referenced here imply, the cost of ownership of HIT will likely go way up once the multimillion dollar lawsuits start adding up.
Considering the track record of health IT as it is in 2011 regarding reliability, risk/benefit and security, spending hundreds of billions of dollars to put patients at risk of life and limb, and risk of potentially career-ending or bankrupting loss of privacy, en masse nationally, is increasingly brainsick.
-- SS