JSE's best efforts fail

By Christelle du Toit, ITWeb senior journalist

Johannesburg, 15 Jul 2008

After six hours of downtime and R7 billion in lost trade yesterday, the JSE is conducting a post-mortem on what led to a network fault bringing down its entire network system.

While the bourse's CIO, Riaan van Vamelen, focused on the problem at hand, JSE CEO Russell Loubser confirmed that, instead of the average R12 billion daily trade, the JSE only did about R5 billion worth of deals yesterday.

This was even though it extended its trading hours until 7pm, after opening for trading at about 3.15pm.

While not wanting to disclose exactly what caused the problem until further investigations have been concluded, Loubser says "the hardware or software that doesn't fail from time to time has not yet been made".

He explains: "We isolated the problem and solved it. Not even a full disaster recovery site could have avoided it - we could still have encountered it. If we failed over to the disaster recovery site, there is no guarantee that it would not have happened again."

Doing things right

Regardless of the losses, it looks like the JSE did all the right things, says Craig Jones, Econarch Data Centre Services operational director. "Disaster recovery is always a learning experience. There is no way to foresee every eventuality."

The process of disaster recovery requires businesses to perform an impact analysis, he notes. "Within that, organisations then need to make a cost-versus-risk decision: essentially, how high is the risk of a particular failure, and then to decide whether preventing that failure is worth the risk."

It is most likely that the particular failure the JSE experienced was considered a low risk and would normally be covered by insurance. "If they haven't failed like this since 1996, then the risk of this failure is minimal."

This kind of massive failure was last experienced at the JSE in 1996, when the trading floor was on and off for a period of five days.

Jones says there could have been many reasons why the company decided not to fail over to the disaster recover (DR) site. "There could have been a delay in replication, which would mean the DR site would not be up to date. A trading environment needs to be current."

There could also have been an undetected vulnerability which would also be on the DR site, or even a connectivity issue between the two sites, he adds. "There is no way to speculate on what the problem could have been, unless they tell you."

Fixing the problem

Jones's sentiments are, to an extent, echoed by Mark Beverley, GM for service delivery at Continuity SA. "This kind of problem can happen," he says.

According to Beverley, the JSE would have done a risk mitigation exercise and, had it identified its network as a risk area, would have built in extra redundancy.

He says: "To reduce their risk, they should [now] do some kind of duplication of their network infrastructure, which is where their DR site comes in. They can triangulate to it, or they can have multiple [data] feeds into their production site [the JSE itself]."

The JSE will have to review its impact assessment, and decide whether this kind of failure is now worth the cost, says Jones. "The exchange will probably analyse the problem in detail and change its DR site."

The JSE has issued a formal apology on its Web site for yesterday's downtime.