Read any technical blog post that gives a deep dive into fixing any type of issue, and 1 thing you notice fairly quickly is that going through the logs is an important part of the process. Debug issues in any application you’re working on, and 1 thing you notice fairly quickly is whether or not your logs are any good. It’s a distinction that can make all the difference when the question of “What the deuce just happened?” rears its ugly head. Better logging can make your life easier, largely by telling you all about the state of what’s going on in your code so you can spend your time actually fixing and updating things instead of running down just what is going on in the first place.
What does better logging give me?
Your logs should be an autopsy of what happened when the user tried to do something with your application. Part of the problems we ran into when we first started focusing on issues with the TMS API was that the logs we were looking at weren’t helpful. They showed queries against our main API, which could have come from the TMS API, or from other customers who were also using our main API on that same server. In addition, we had no logging whatsoever on the TMS API side about what users were trying to do. When we were starting and there weren’t a lot of users hitting this API, relying on our main logs worked, but now that the TMS API is getting bigger, we desperately needed better logging.
Once we got our immediately, always crashes every time, problems to something manageable, we still weren’t out of the woods. The servers hosting the API were still crashing, albeit irregularly. All we had to go on was the production logs of my company’s main API, which included the queries going directly to that API (we had other customers hitting those servers directly) and queries being generated by the TMS API. Worse, to figure out what TMS API call led to a not in a log entry, we had to reverse engineer the main product call and then try to determine what (if any) TMS API call could have caused it. Better logging could have told us which of these calls were coming from the TMS API, what those TMS API calls were, and whether or not they were getting hung up before or after the call to our main API. That kind of information could have directed us to precisely where we were seeing problems immediately, instead of wasting hours trying to run them down.
What we learned with better logging
So we put in all this fancy-schmancy logging and what did we learn? Well for starters, we learned that whatever was causing the issues we were seeing wasn’t coming through TMS’s API. As I mentioned in the previous post, There were other customers using our main API on that same server. With the new logs, we could track the thread that was hanging the server, and establish that the thread wasn’t coming from the TMS API, but was rather hitting our service directly.
The improved logging also helped us debug another call in our API. We had a call to get all the new shows that had aired in the last week that was really freaking slow. The new logging confirmed to us that almost all the time behind that call was spent waiting for the query to return. Digging into that, we learned that we had several unnecessary joins leftover from some old code. Clearing those joins shaved about 2/3 of the response time right off the bat.
In keeping with our “let’s be more aware of what’s doing what, another good thing we’ve done is split the TMS API out of the same container as our main service. Originally, the TMS API was just a mere war in the same container as our main search API. Now, it’s its own service in its own container, which means if anything goes sideways with 1 of the services, it won’t take the other service down with it.
The TMS API is a lot more stable after these recent changes, which is great. Now we can focus more on performance (it needs to be faster), features, and bug fixes – all the things that make the TMS API better, as opposed to just keeping it running.