I spend a lot of time at work on a read-only RESTful API. A little bit of big picture here, the company I work for, Digitalsmiths, builds the data delivery APIs for TMS (they provide most of the scheduling data you see when you “Guide” on your TV remote). This is powered by Digitalsmiths’s own APIs. Basically what happens is that the TMS API is a front-end for our own API. Calls come in, we build queries for our own API, and query said own API, process the results and then return the appropriate data. When the API was first written, the data covered just the US and Canada. As TMS expanded started to cover other countries, that data grew, and with that has come some growing pains. There’s a few extra quirks to this API. For starters, customers user this data via our main API, so while we can add fields to our responses, so removing data is a no-go and major changes are a big concern. We get data dumps in xml files, and a whole other group of developers handles ingesting that data and importing into our cores. So, we don’t control the data source, and can’t modify our data source too severely for fear of breaking other customers’ apps. It’s an API, so removing query parameters is a very bad thing to do to other people. We’re writing this on behalf of another customer, so we can’t just add any parameter that seems like a good idea at the time. This is still an ongoing thing for us, so this blog post won’t cover an entire solution, just our first steps at tuning and what we’ve learned at the beginning of the process.
Just how many episodes of The Simpsons are there?
One of the first things that came up was that out call to get all episodes of a series was blowing though the entire heap. That’s pretty bad for everybody, so the first thing we had to do was fix that, and quickly. The way this call worked is that we would get all versions of all episodes from our data set, then we’d process the whole lot and build a list of specific episodes to return. That way, if there was a gap in whatever your preferred language is, we can fill it with a version in a different language so you at least know an episode exists, just not in your language. As we expanded the amount of data we ingested, the number of versions of episodes for longer-running series, like The Simpsons, grew too big for that to work, so we had to completely re-think that call. All of this was exacerbated by us separating program images from program records and then joining on those images when we need them. As a result, we got some great internal improvements for finding images, at the cost of having a lot more documents in memory for some of these API calls. The main thing we needed to do was limit the number of results we were processing at one time.
Our first attempt at this was to limit our results to unique TMS root IDs. Root IDs are unique per actual program, and link together all the different versions of things like episodes of a series. Thus, there’s a root ID for episode 4 of season 3 of The Simpsons, even though every translation of that episode has it’s own unique ID. That would allow us to limit our results to only 1 record per episode of the show. The problem was, we had to get the list of unique root IDs, then look up each root ID 1 at a time, then sort everything (oh yeah, these results are supposed to be sorted in ascending order). That was a terrific improvement in memory management. However, it was a pretty terrible hit in response time, by an order of over 10 seconds. Making a relatively (from an API user’s perspective at least) simple call take over 10 seconds longer is not acceptable, so we needed to re-visit the drawing board.
1 of our developers had the idea of grouping these episodes by the year in which they first aired. This helps jump start the sorting since we’re dealing with smaller result sets at a time, plus by querying in ascending order of release years, we can sort each batch and then add subsequent sorted batches, and know that everything’s in the proper order. We can also use this batching to limit the number of results we’re getting from our data set, controlling the memory we’re using, which solves our original problem. So now, the process looks more like this: we get the list of years in which episodes of that show aired. This listing of years also includes how many episode versions aired in each year, so we can control our batches based on how many episodes will be returned when we query for those episodes later.
We use this information to build batches of years for which we’ll pull back all versions episodes of this show, just like before. The difference here is that we’re not doing this all at once, we doing this in controlled chunks. This gives us a much better performance in terms of memory, without the correspond sacrifice in response time we saw by limiting ourselves to just the 1 version of each episode we want to return.
There’s also another thing we’ve done to help reduce our memory footprint – under the original process, we looped through the results from our data set, and ultimately built an object to return once we found the “best” version of that episode from our result set. At no point in this process were we cleaning up stuff we were done with. That means there were 2 objects in memory for every result we returned, 1 from our data set and another object with some or all of that data that ultimately got sent back to the caller, plus another object for every version of the episode we passed over in looking for the preferred version based on the query parameters. In other words, we use a lot of memory, and we weren’t cleaning up after ourselves after we got done with data. That needed to change as well. Now we use Java’s Iterator object to iterate through our data set results, and use it to remove records once we’re done with them. Now we’re not holding on to objects longer than we absolutely have to, which has helped leave heap space for other calls occurring at the same time, as opposed to crashing our rather large servers every time someone asked for the episodes of The Simpsons.
Other low-hanging fruit we picked while we were at it
Now that we’re looking very closely at how the code performs and just what it’s doing. While doing some other maintenance, that led to us making a couple of other very small changes to our code that helped clean things up and reduce the amount of thinking our code has to do. For example, when we broke out our image metadata to their own collection, we had to re-work a lot of the API code in pretty short order. In the interest of time, and making as a few bugs as possible, I kept the changes as limited as possible, just changing the field name to what our fancy-schmancy behind-the-scenes-join between program records and images did. That worked for getting the changes deployed quickly and fairly painlessly, but we didn’t stop and re-think just what that meant in terms of performance and efficiency, leading to problems like the one described here (amongst others).
One such place in the code where that really slapped us in the face was our program images call. This call is pretty straightforward, return all the images for the given program. In other words, this new images core is precisely the core we should be calling. But, remember when I said I just did as little as possible to get the join to work? That’s because when all our image data was part of the program record. The first version of this call after breaking out the images resulted in me just throwing a join in because that worked. However, all this call returns was images, not any type of program metadata whatsoever. Really, the correct (read: what I wound up doing the second time when I stopped to actually think about it) way to do this was query the images data directly.
But wait, there’s plenty more…but it’s probably better for another blog post
These limited changes by no means solve our problems. There’s a lot of work to be done. Specifically, there’s a lot of work to be done that we haven’t done yet. There’s probably a lot that we haven’t figured out we need to do. I have a working list of stuff that needs doing, and it grows just about every time I start looking at the code.Since we haven’t really done, or found, all that other work yet, it’s probably better meant for another blog post. For now, I just wanted to document what we’re finding and going through, just to chronicle what we’re finding and learning in the hopes that it triggers something in other people who work on APIs to help them avoid these issues before they manifest. So far, here are the things that we already know we need to look at going forward:
- Results from our dataset are fairly large. Given that this issue was caused by overwhelming heap space on the server where both our main product and API are running, we need to be mindful of just what affect the queries we’re making have on the server. That means whenever we go through every API call and making sure we’re only returning the records we need, and only the fields in those records we need.
- Right now, we just know that some calls are slow. We have a hard time determining a) what exactly was called (the issue may be the query as a whole, but it’d be nice if we could see the exact parameters and mimic the specific calls that are causing problems.
- We need to be faster about finding images. Once images stopped being a part our program data, that added a new set of joins to our query. It’s working right now, but it could clearly be faster. We need to look into ways to speed that join up to make it almost as fast as pulling back additional data from our program records. It’d also be really nice if we could keep the number of images we pull back to a minimum as well, the more efficient we can be with our memory the better.
We’re just starting on our quest to get the TMS API into a lean, mean, super fast, metadata returning machine. Sadly, we’re still on the part where we we’re reacting to problems as we find them instead of getting proactive, but we’ve at least moved out of crisis, “the server is crashing” mode. It’s a nice start, and I’m hopeful that it’ll one day be successful.