Loss of data and system downtime is a problem. For most people reading this, this is a self evident truth. Though what I find “on-the-ground” is a wholly different and unsavoury matter altogether.
I was recently socialising off-the-clock with a few folks in the telecommunications industry and a story was relayed to me. A story that prompted me to clock in as a security professional to offer advice and guidance, as I could not in good conscience let it pass without at least pointing a few things out.
During the course of the week starting 19/03/12, an operating company of a very large mobile operator experienced a system outage.
The system outage impacted the operators soft revenue generation mechanism (Call completion to terminating subscriber when subscriber is IMSI detached) for a significant portion of that region’s subscriber base. It happens. Hardware does not last forever. In particular, hard drives have a limited lifespan. How you prevent widespread impact of drive losses is well known. Effective storage system design, a hardline approach to backups as well as competent intervention teams (vendor and operator) and system re-engineering as a result of effective problem management.
By time I had heard about the problem, it was still ongoing and had been for days and had been attributed to a “Known Error” which had occurred many times in the past on this vendor’s product line.
The vendor field intervention team reacted to the client report and reverted back to their global support center. A JBOD had failed.
Recommended intervention which would not have resolved the outage, cemented the problem by corrupting the root filesystem. This further aggravated the outage by introducing further system failures. This “Known Error” had clearly not been effectively documented in the vendor’s knowledge base, nor had it been addressed with effective problem management despite it’s prior occurrence globally. Nor was the senior support person assisting the incident response team sufficiently qualified to deliver support.
The last backup performed was 2 years prior, and when it was restored, did not work – I won’t go into why, but I will say the 2010 backup was not sane and therefore useless and essentially not a backup.
Not only was there no backup policy in place for a production system. Which in itself is a violation of the operator’s mandate for systems under control of Vendor Managed Services, but for years the field teams did not follow internal Field Change procedures laid out i.e. Backups before and after implementation of Field Change Orders – several Change Orders had been implemented in the past two years.
A sad state of affairs. Fortunately, this problem could have been resolved with a few more hours of work and a sound understanding of Unix. On a system that is built around 4 Unix boxes, a number of Linux machines, and a few instances of a Real Time OS’, one would expect incident response teams to be knowledgable or have the means to solve these problems within the team. They could not; and subsequently a senior vendor resource had to be flown in to the region unnecessarily, at great expense (long haul flight, hotel, per diem) to perform the rebuild. A rebuild that could have been procured on a day rate from skilled resources in region at a fraction of the overall cost and with a quicker turn-around time.
So, we have a loss of operator revenue, data, reputation and a severe impact on vendor operational expenditure and reputation.
All of this is simply a compounded effect of:
- Poor procedure
- Lack of stakeholder oversight
- Poor Service Level Agreement management
- Poor risk analysis and impact assessments associated with Managed Service contracts
- Low cost and arguably inadequate resources tied to Managed Services
- Absence of training and development of aforementioned resources
- Lack of regular, independent assessment and testing of Managed Services capabilities
- Lack of regular, independent assessment and testing of Vendor Support mechanisms
The requirement of a well defined penetration testing program is more than just testing your estate from Cyber Attack. It’s about identifying all vulnerabilities in your operation, be they physical, technical or human in nature. Penetration testing needs to be a full time consultancy covering all aspects of your business, no matter what.