Extended retry and traceback strategies with Python

In this post we will be presenting a number of practical and generic solutions to concrete problems that may be encountered in everyday life by a Python developer and that are too short to deserve a dedicated package, but at the same time too specific to find a place in the standard library.

Retry strategy on http requests

Any server displaying a http REST API worthy of the name appropriately uses the status codes to inform the client of the outcome of the request made. Of particular interest for an API client is the management of errors belonging to the 4xx family, i.e. those where the server reports a problem on the request made by the client.

In these cases the server is required to specify whether the error situation is permanent or only temporary: for example, take the case of the code 429 Too Many Requests, typically used to report to the client the exceeding of a quota or a rate of requests. The request has failed, but will be successful if executed beyond the quota renewal.

Let’s have a look at a functional but naive solution to the problem of setting up a retry, which uses the known library called requests. For the purposes of this discussion, we assume that the request task runs asynchronously with respect to the main thread (for example by using a distributed job mechanism such as Celery in a web context, or a task executed by a timer in a desktop context) and that it is possible to invoke an attempt to retry raising an exception with these semantics:

r = requests.get('https://api.github.com/events')
if r.status_code == 429:
    # Assume che il rinnovo del quota avvenga entro il minuto.
    raise RetryException(interval=60)

This code works, it is very simple and solves the presented problem. However, the reality is much more complicated: for example, it is common practice that the time to wait before the quota is renewed is indicated by the server in the body of the response. We don’t give up and we update our code:

r = requests.get('https://api.github.com/events')
if r.status_code == 429:
    response_body = json.loads(r.text)
    try:
        retry_interval = response_body['retry_interval']
    except KeyError:
        # Assume che il rinnovo del quota avvenga entro il minuto.
        retry_interval = 60
    raise RetryException(retry_interval)

Unfortunately, it is still not enough. There is also the possibility that the requests will time out, another case in which a retry strategy can be applied:

try:
    r = requests.get('https://api.github.com/events')
except requests.exceptions.ConnectTimeout as exc:
    # Attendiamo 2 minuti in occasione di un timeout.
    raise RetryException(120)

if r.status_code == 429:
    response_body = json.loads(r.text)
    try:
        retry_interval = response_body['retry_interval']
    except KeyError:
        # Assume che il rinnovo del quota avvenga entro il minuto.
        retry_interval = 60
    raise RetryException(retry_interval)

The structured logic is becoming complicated (there are already two different points in which the code raises the retry exception) without considering that timeouts on different endpoints could require different retry times; furthermore, we have not considered that there are another 4xx errors that we might wish to manage with a retry strategy (for example, the 409 Conflict). How can we map this logic to a more effective structure?

Context manager

The Python standard library presents itself to us offering us context managers that, using the with operator, allow us to execute a function within a controlled context. To better understand the tool, consider the following code as an example:

@contextmanager
def time_logger():
    """Stampa il tempo che il codice controllato impiega per essere eseguito."""

    # Prologo: eseguito in entrata del contesto
    start = time.time()

    # Passaggio del controllo al client
    yield

    # Epilogo: eseguito in uscita dal contesto
    end = time.time()
    print("elapsed time: {:.2f}s".format(end - start))

with time_logger():
    s = sum(x for x in xrange(10000000))

The output will be something like this:

elapsed time: 0.79s

As shown in the example presented, the context manager essentially consists of three parts: the prologue, the passage of control (yield statement) and the epilogue; any state set by the prologue is maintained during the execution of the controlled code, and the epilogue will be executed at the end of the latter.

Retry strategy

In addition to allowing us to build a primitive but effective profiler, context managers can be applied for our purposes to collect the retry logic, once and for all, and to parameterise it based on the types of errors and timeouts. This is the interface we expect to use:

# A ciascun tipo di eccezione associa l'intervallo in secondi da aspettare
# prima del tentativo successivo.
retry_delays = {
    client_api.TimeoutException: 120,
    client_api.TooManyRequestsException: "retry_interval",
}

with retry_strategy(retry_delays):
    client_api.do_request()

In this scenario, suppose we have implemented a client for the REST API that we want to interface with, and that this defines their own exceptions. One of these in particular, TooManyRequestsException, will make the retry time specified by the server accessible (which probably was originally in the body of the response, but which is convenient to expose to the calling code).

Let’s see how to proceed with implementation of the retry strategy, which with the premises made is simple and linear:

@contextmanager
def retry_strategy(delays):
    """Definisce una strategia di retry per il codice controllato, facendo in
    modo che gli intervalli di retry siano dipendenti dal tipo di eccezione.

    :param delays: mappa che associa il tipo di eccezione da gestire ad un
        valore intero (secondi da attendere prima di riprovare) o ad una
        stringa (attributo dell'oggetto eccezione che contiene il tempo
        in secondi prima di riprovare).
    """
    try:
        yield
    except tuple(delays) as exc:

        for handled_exc in delays:
            if isinstance(exc, handled_exc):
                delay = delays[handled_exc]

        if isinstance(delay, int):
            # L'intervallo da attendere è specificato dal client
            retry_interval = delay
        else:
            # L'intervallo si trova all'interno dell'oggetto eccezione
            retry_interval = exc[delay]

        # Innesca il tentativo di retry.
        raise RetryException(retry_interval)

The control step is inserted inside a try/except clause, which will be used to intercept the exceptions defined as keys of the delays dictionary.

In the next block the time interval to wait before the next retry is recovered. Finally, the context manager triggers the retry attempt.

The more complex parts of the logic are thus “hidden” and written, once and for all, inside the context manager, while the client simply has to worry about defining when and for how long to wait, and execute the request in context.

Further developments

The structure thus defined lends itself well as a basis for further developments, which will not be addressed in the context of this post and are left as an exercise to the reader. For example, a feature that you might expect from a retry strategy is the ability to define a waiting time that exponentially grows on a timeout error, up to a maximum of expected attempts.

Deluxe exception hook

One of the indisputable advantages of error handling with exceptions is that it is not possible to forget to check an error status. If an error occurs in a Python program then an exception is propagated along the program stack, and if it is not managed and reaches the bottom of the stack the program automatically ends, showing a traceback.

Traceback is a first step in understanding what went wrong and remedying it, but often knowing only the sequence of stack frames that led to the error situation is not enough to reproduce a crash.

For this reason, many frameworks and certain web microframeworks include a custom Python exception handler, which shows both stack frames and the values of the local variables referenced by them via html interface. In many of these cases it is also possible to interact with the context of the server at the time of the crash through an interactive interpreter.

The possibility of having an enriched traceback of this type is a luxury that backend programmers have learned to take for granted, but for various reasons is not equally widespread in a desktop or embedded context. It is a real shame that, in light of the fact that given the nature of this sector, often a traceback is the only way to understand how to reproduce a crash on a computer on the other side of the world. Let’s see what can be done to fill this gap.

Exception hook

Python’s behaviour when an exception reaches the bottom of the stack is implemented by the sys.excepthook function, which is invoked before the interpreter terminates.

An exception hook can be defined simply by redefining sys.excepthook and setting its value to a function we created. The interpreter will call it automatically passing the type, value and traceback of the exception. Here is a minimal example of an exception hook, along with the output of a program that uses it:

import sys

def exception_hook(e_type, e_value, e_traceback):
    print("Exception hook")
    print("Type: {}".format(e_type))
    print("Value: {}".format(e_value))
    print("Traceback: {}".format(e_traceback))

sys.excepthook = exception_hook

1 / 0



Type: 
Value: integer division or modulo by zero
Traceback: 

From the output, different from the one familiar to Python programmers, we deduce that the default hook exception was not called, as it was overwritten by our custom version. The first two topics are the type and value of the exception.

The third topic e_traceback is the one that interests us most, as it contains information on stack frames crossed by the exception. We would like to use it as a basis to implement our exception hook which, in addition to the information already reported by the default one, includes all the values of the local variables.

A more complete exception hook

The traceback object exposes the stack frames crossed by the exception. First we write a utility function that accumulates them all:

def extract_stack(tb):
    stack = []

    if not tb:
        return stack

    while tb.tb_next:
        tb = tb.tb_next

    f = tb.tb_frame

    while f:
        stack.append(f)
        f = f.f_back

    stack.reverse()
    return stack

Once this is done, it is possible to visit the frames in order and to inspect the f_locals member, which contains the dictionary of names and the values of the local variables:

def exception_hook(e_type, e_value, e_traceback):

    # Recuperiamo gli stack frame
    stack = extract_stack(e_traceback)

    here = os.path.split(__file__)[0]

    # Informazioni di debug che popoleremo
    info = []

    for frame in stack:

        # Recupera il percorso del file corrente
        fn = os.path.relpath(frame.f_code.co_filename, here)

        # Aggiunge nome del modulo, funzione e riga di codice
        # alle informazioni di debug
        info.append("  {}, {} +{}".format(frame.f_code.co_name,
                                     fn, frame.f_lineno))

        # Salva i valori delle variabili locali
        for key, value in frame.f_locals.iteritems():
            info.append("   {} = {}".format(key, value))

We also reproduce the behaviour of the original exception hook, which consists of printing the traceback of the exception.

    s_traceback = StringIO()
    traceback.print_exception(e_type, e_value, e_traceback, file=s_traceback)
    info.append(s_traceback.getvalue())

    # Stampa a video le informazioni raccolte
    print '\n'.join(info)

Let’s see how our exception hook behaves on a test program:

# Installiamo il nuovo exception hook
sys.excepthook = exception_hook

def dbz(d):
    c = 1
    return c / d

def run():
    d = 0
    dbz(d)

run()

In addition to several uninteresting values defined in the main module, the output will also present this section, which is used to identify not only the cause of the problem but also to see both values involved in the division operation.

  , exc.py +88
   run = 
   StringIO = StringIO.StringIO
   [...]
  run, exc.py +86
   d = 0
  dbz, exc.py +82
   c = 1
   d = 0

At this point, it is natural to ask the question of what happens if the exception hook code, in turn, raises an exception. In this case, the interpreter will keep both exceptions, reporting them via the default hook exception; however, it goes without saying that when writing code of this type it makes sense to use defensive programming practices, also to allow us always to use the information gathered for reporting purposes.

Finally, a step that I consider mandatory if taking this route is to add the logic’s exception hook to retrieve the information specific to your application. For example, the project name currently open, the tool in use, the mouse position, the last reading from the serial, etc. Any element of your domain that helps to identify a crash must be iteratively added to the reporting system.

Crash reporting

With little effort, we have created an exception handler that provides much more information than the default one.

The next logical step is to collect it and to make sure that programmers can consult it to intervene on the causes. Let’s not forget that the context in which we suppose we are working is that in which the crash environment is a desktop machine somewhere in the world, to which programmers do not have direct access.

A first solution is simply to save the exception report locally. In the event of an error message, the user may be asked to send via e-mail the files contained in the directory chosen as destination; or, if we are on a desktop system, we could automatically open the user’s e-mail client, pre-filling the body of the message with information about the crash.

A natural extension of these solutions consists of sending reports to a reporting service displayed on the Internet; these are typically composed of two parts, a client part and a server part. The first consists of a module for the language being used in the application, which implements an interface that is as simple as possible to send reports. The server part instead performs the task of exposing one or more http endpoints for sending reports, and in addition to storing them, it makes them listable and inspectable remotely.

In Develer for this purpose we use Sentry which is very popular and easily integrated with a large number of platforms, in addition to being open source and installable on our own servers. The code we need to implement on the client side is surprisingly simple:

from raven import Client
from raven.base import ClientState

def exception_hook(e_type, e_value, e_traceback):
    [...]

    # SENTRY_DSN è fornito da Sentry ed è un endpoint specifico
    # per il vostro progetto
    sentry_client = Client(SENTRY_DSN, timeout=SENTRY_TIMEOUT)

    # Processiamo l'eccezione. get_additional_data() può essere
    # implementata seguendo la linea guida del precedente exception
    # hook, e deve includere le informazioni specifiche del dominio della
    # vostra applicazione.
    sentry_client.captureException((e_type, e_value, e_traceback), extra=get_additional_data())

    if sentry_client.state.status != ClientState.ERROR:
        # Se il report è stato caricato correttamente su Sentry,
        # mostriamo un messaggio che informa dell'errore dell'applicazione,
        # e usciamo.
    else:
        # Altrimenti, memorizziamo comunque il contenuto del traceback
        # su un file.

Conclusions

Both arguments presented are inspired by concrete cases, which were actually encountered on projects in production. The remarkable expressiveness of Python allowed in both cases the creation of two small “recipes” that can also be reused for other projects, with very minimal abstraction effort.