As we’ve discussed previously, the data center is business critical to many industries and social media networks are no exception.
In 2014 Facebook suffered its biggest worldwide outage after updating the configuration of one of its software systems. Things didn’t go to plan and the outage resulted in Facebook’s users being unable to access the website via the internet or its app. While the outage only lasted 31 minutes, users quickly migrated to other platforms and publishers saw referral traffic ‘fall off a cliff’.
The company now takes what some might deem a daring approach to testing data center infrastructure resiliency. It regularly shuts down entire data center sites to see how its application will fare against a catastrophic event, such as a hurricane, and assesses the improvements that can be made to protect it.
Hurricane Sandy in 2012 took out many data centers located on the East Coast, and the damage was so great, recovery was very slow with prolonged periods of service disruption. Facebook understands only too well how an outage at just one data center, each of which processes tens of terabytes of traffic per second, could impact its service.
GitHub, the social networking site which is also a code sharing and publishing service for the developer community, learnt the hard way when a power outage in its primary data center took out hosting services for two hours earlier this year. In this case much of the comments from programmers were not too negative and pointed out the irony, but it has led to more serious comments about the resilience of the site.
GitHub has had to expand its data center capacity due to a massive intake of new repositories prompted by the popularity of open source software, but is it testing resilience too? While power outages do happen, it is usually as a result of failure in backup systems, and not in utility power problems, and that is something that can be addressed through testing.
GitHub and other social networks could learn from Facebook’s example and ensure they utilize solutions in their data centers to deliver real-time information about the environment, particularly air conditioning, heating and water.
Monitoring both the data center environment and assets is the best way to identify problems before they arise, and deal with them quickly, whether it’s a power failure or missing assets. Programmers may be more understanding than the average social media user in the event of an outage, but downtime is downtime, and something that all services want to avoid.
Discover how RF Code’s CenterScape can help you with your own environmental monitoring and asset tracking here.