Posted by Thomas McElroy Over the last 3 weeks, you may have noticed some instability with our Rankings tools through missing data and error messages stating some tools are unavailable. On Friday, we experienced a totally different, unrelated problem with our rankings data. We expect to have an updated prognosis for that problem by tomorrow, but we want to fill you in on what went down at Mozplex to cause these issues in the first place. To be as transparent as possible about what happened and how we’re working to fix the issue, below is a summary of what was impacted, the work we did to get things going again, and what we’re doing in the future to make the system more resilient. Database issues? What gives? Our SERP data subsystem (which runs on the distributed storage technology Riak) had a couple of nodes fail. To learn more about Riak, here’s a blog post we wrote when we made the switch last year. The subsystem is designed to handle such failures; however, we did not handle the failure correctly. In the process of fixing our Riak storage, we disrupted some of our queues for SERP data processing. Given Moz’s growth over the last six months and the number of SERPs processed in the Riak cluster, Roger can no longer recover from outages in a timely manner. In late 2011, we could recover the system in 3-8 hours and be caught up on data processing in a few days. This time around, it took us six days to get the system back up and another two weeks to catch up on the missing data and the inconsistent data states that resulted. Impacted services Riak stores our SERP data (rankings data), so all the systems that depend on it were impacted. The impacted systems include: Custom reports On-page reports Historical rankings CSVs Rankings Keyword Difficulty & Full SERP Analysis reports Work completed to get things going again Our dev teams have been hard at work to restore all missing and inconsistent data post Riak malfunction. At a high-level, here’s what we did to get Rankings and all its dependencies going again: Created scripts to heal the different broken states of jobs Added more nodes to speed up processing and help in future failures Improved monitoring to get information about failures and performance bottlenecks Improved performance in a multiple areas Future work It took the team 20 days to fully recover from the cascading problems that resulted from the original issue. We know that this timeframe is highly unacceptable, and we apologize for not being able to recover quicker. We are now in the process of ensuring that the same failures do not occur in the future and to lessen downtime in the event something like this does happen again. Work has begun on multiple improvements to help us reach our goals, including: Improving health checks and threshold monitoring of Riak nodes and subsystem dependencies Adding more Riak nodes Beefing up queue and job execution monitoring and alarming Creating a dependency matrix that indicates what’s impacted when something goes down Improving fault tolerance in parts of the system Providing additional excess service capacity Creating system operations documentation for dealing with emergency scenarios and how to recover So, what’s the current ETA? Unfortunately, as you can probably tell, we have a lot of work to do to get Rankings back to 100%. We don’t have an ETA quite yet. However, we hope to have a solid date in place by tomorrow and will update the post as soon as we know. Again, we apologize for the failure and any issues it has caused. We are working our butts off to ensure it doesn’t happen again! If you need an immediate alternative for rank-checking, try using the Rank Checker at SEOBook . For status updates on this issue, please check out our Rankings page on the Help Hub . Sign up for The Moz Top 10 , a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!
Follow this link:
Where Are My Rankings?
Latest posts by Thomas McElroy (see all)
- Where Are My Rankings? – September 6, 2012
- Drive Failures Affecting Some Customers’ Rankings and Reports – May 2, 2012