The inception of autoanomaly came about, like many inventions, to solve a need. In this case, the challenge surrounded a lengthy and repetitive debugging playbook to identify potential root causes of firing alerts while I was a site reliability engineer during my time at LinkedIn.
The typical procedure began with review of several pages of time-series graphs, trying to identify an anomaly in the metrics which could explain the operational issue and also correlated with the time frame. Our infrastructure was highly microarchitecture-based, and the team I worked on was the Identity team, which happened to have a significant amount of traffic and a large number of upstream clients and several downstream dependencies. The overall services we managed happened to be rather internally stable (at least after we completed a migration to a newly designed architecture) so if an issue arose it was highly likely to be caused by some client or dependency of ours.
The overall time spent trying to pour through pages of graphs was frustrating, but also drastically increased our MTTR at times. An idea occurred to me – was it possible to leverage some machine learning techniques to automate this investigation? I knew it wouldn’t be feasible to use any methods which required training, as the amount of data was enormous and in most cases, highly noisy. Instead, if an algorithm could calculate some correlation score between two time-series datasets, it may be possible for this to work.
Our observability stack had a Python API, which facilitated quick turnaround of a prototype, and as luck would have it, our performance team had written an anomaly detection and correlation library which also happened to be written in Python. We’re off to a good start!
I met with a few people from the performance team, described what I was trying to accomplish and got some information regarding their library. A few days of coding and testing later, and I had a working solution. There were a few parameters which required a bit of tuning, but overall the out-of-box algorithms and defaults worked remarkably well for this use-case.
With my new tool, the time spend reviewing graphs of time-series charts went from several minutes to a few seconds. I then went on to design some algorithms to allow correlation between two other data sources which did not have native time-series data available: feature activation ramps and software deployments.
Shortly after this, I ended up switching over to the performance team and worked on very interesting problems and found other successes: namely, getting a 10x query latency improvement on a 100% CPU-bound in-memory graph database through analysis of x86 CPU pipeline stalls and leveraging in-line assembler code. However, the switch to this team meant that I no longer had available time to work on autoanomaly and was not able to pursue my initial desires to go through the open-sourcing process for it.
The IP for this lies solely with my former employer now, and I was encouraged to submit it to the internal patent submission process they had. I did so, they were interested in the idea, and in the end the US PTO granted me (well, them technically) US 9,891,983.
Leave a Reply