YPlan Cinema Club: Handling Responses From Multiple APIs

YPlan Cinema Club

Cinema Club

YPlan Cinema Club is a utility app for cinema lovers that aims to provide accurate information about films and cinemas. You can also use it to check what is currently playing in cinemas near you or buy a voucher.

More information about the app can be found in this excellent post by Ben Dunn: "The new app for cinema lovers"

Please note that the above post describes the first version of the app, if you want to see what features are currently available, check it out on App Store (iOS) or Google Play (Android).

In this blog post, I wanted to talk about something different - the internals. On the backend side, we are integrating multiple APIs to get the information about films, cinemas, showtimes. We gather the data by running multiple background jobs few times a day, and each of them queries an external API. The film data is represented differently on each API and merging them together to store it consistently in our database is the main challenge.

Prototype

In the beginning, we received access to one showtimes API provider. Since we did not want the client to call that API directly, we constructed our own (simplified) response format. The client would then query our internal API that was living on cinemaclub.co server rather than the external URL directly. By doing that we had control over the structure (e.g. we wanted to support a subset of JSON fields with our own identifiers) and availability (our server could cache the data) of the response. Another benefit was that we abstracted external API and could replace with another one without notifying the client.

The first (hackathon) version was a simple proxy that made external API calls and formatted the response to match our JSON format.

The second version ran a background job and cached the response in a large JSON blob in the database, along with the last_updated_time and region_id:

json
region_id
last_updated_time
json_blob

Every time the client made a request, that JSON blob was fetched from the database and filtered using user location and a 40 km radius to produce a smaller response that was sent over the network.

A new row in json table was added infrequently (few times a day) and when we served the response we would get the one with highest last_updated_time. Also, if there was a problem with the current data we could rollback it to the previous day by removing the last row.

The json_blob itself stored all details we needed to serve responses to all clients in a given region_id (many fields omitted for brevity):

{
  "films": [
    {
      "id": 123,
      "name": "Super Film",
      "description": "..."
    },
    ...
  ],
  "showtimes": [
    {
      "film_id": 123,
      "cinema_id": 1,
      "data": "2016-07-16",
      "time": "20:30"
    },
    ...
  ],
  "cinemas": [
    {
      "id": 1
      "name": "Cinema Name",
    },
    ...
  ]
}

The next logical step was to move away from one large response and store the information about films, cinemas, showtimes in the database. We wanted to have the ability to query the data to see how it looks like (e.g. we could analyse the data using a visual tool such as Looker), and to avoid over-fetching during client requests that was apparent in JSON blob solution.

The background task would process the JSON objects from external API and store them in our database, e.g. simplified film can be represented as:

film
id
external_api_id
imdb_id
title
description
poster_image_url

All metadata of the missing information was updated by background jobs on the existing objects (e.g. IMDb rating, photos) by using other APIs.

Adding second external API

The above approach works well as long as we have only one source of films (single external_api_id).

The problem appears when we want to plug-in another films and showtimes APIs. Most of the films share common ID (imdb_id in our case) - when this happens we want to merge the film fields by prioritising one API over another (on the individual field level).

Some of the films, however, do not have the imdb_id - in this case we prefer to create a new film, but when it gains the imdb_id we would like to merge it (if there is an existing film with the sameimdb_id).

Simply updating the objects online would not work well as we do not have all the information for field prioritising (e.g. we may want to set the value on the field that appears in the majority of the external API responses).

We would like to create something that allows us to separate film updates from capturing the data from external APIs.

Using API Responses

How did we solve this problem?

Previously: One source of films.

Without API Responses

Now: The flow is divided into two steps. Multiple sources of films:

With API Responses

1. Data capture

Capturing raw JSON objects from external APIs into the apiresponse table.

api response
api_name
api_film_id
json_response
film_id

There's a unique constraint on (api_name, api_film_id).

NOTE After the capture phase, the film_id field is NULL for the new rows. Otherwise, we update json_response with a new value.

The json_response looks different depending on the API (api_name).

2. Data processing

We use captured JSON objects to construct film objects and update their showtimes (more specifically film_id on showtimes) accordingly (when a film was removed due to merging into another film). The processing itself consists of two steps. Firstly, we gather all the API Responses that we believe represent the same film. This can be done in naive implementation by aggregating API Responses by imdb_id (extracted using a separate function from each json_response), and if it's not present by (api_name, api_film_id).

It's also possible to improve API Responses collection by using film_id and other IDs that we can find in json_response (e.g. TMDB ID).

The second part is about using these API Responses to construct single film object. We have a function for each API that returns a common dictionary with the field values that we would insert if that particular API was the only one. Using the simplified film model mentioned before:

{
  "imdb_id": 1234567,
  "title": "Film Title",
  "description": "...",
  "poster_image_url": "https://example.com/image.jpg"
}

NOTE: All the fields have to be normalised to our format across different external APIs.

Finally, we merge these dictionaries with a separate function for each field and insert or update that film in the database.

For example, this is how the functions could correspond to the above fields:

{
  "imdb_id": first_most_common,
  "title": shortest,
  "description": longest,
  "poster_image_url": best_aspect_ratio
}
def first_most_common(values):
    def by_key(x):
        return values.count(x), -values.index(x)
    return max(set(values), key=by_key)

NOTE: first_most_common is useful in case of draws. We prefer to prioritise the values that appear first in the API Responses list.

Metadata information

Metadata tasks run independently from the film creation flow (we want to have up-to-date information meaning that these tasks should run frequently). Each of the tasks queries the film using unique ID such as IMDb, TMDB and updates the fields that it's responsible for. It is possible that multiple tasks return information about the same detail of the film (e.g. poster image), in this case, we decide which task is responsible for it. It is easy to see that this metadata processing could be improved to use the same flow that we use for films.

Final words

In conclusion, I believe that by creating two-step flow we have better control over how films are created in the database. This approach allows us to analyse the raw responses from APIs and database objects separately which simplifies errors handling (we know when the data is missing on the external API). What's more, by decoupling film construction we can run the processing task more frequently (which is important when we make changes to it).

kthxbai

Trains Photo credit: tokyoform via Visual hunt / CC BY-NC-ND