Blog Technical posts

Data-driven testing with Python

22nd February 2017 Emilio Veloci · Software developer

Pay attention to zeros. If there is a zero, someone will divide by it.

Cem Kaner

Writing code, a good test coverage is an essential component, both for the reliability of the code and for the peace of mind of the developer.

There are tools (for example Nose) to measure the coverage of the codebase by checking the number of lines that are run during the execution of tests. An excellent coverage of the lines of code, however, does not necessarily imply an equally good coverage of the functionality of the code: the same statement can work correctly with certain data and fail with other equally legitimate ones. Some values also lend themselves more than others to generate errors: edge cases are the limit values of an interval, the index of the last iteration of a cycle, characters encoded in unexpected way, zeros, and so on.

In order to have an effective coverage of this type of errors, it can be easy to find yourself having to replicate entire blocks of code in tests, varying only minimal parts of them.

In this article, we will look at some of the tools offered by the Python ecosystem to manage this need elegantly (and “DRY”).

py.test parametrize

Pytest is a valid alternative to Unittest at a distance of a pip install. The two main innovations introduced in Pytest are the fixture and the parametrize decorator. The former is used to manage the setup of a test in a more granular way than the classic setUp() method. In this blog post, however, we are mainly interested in the parametrize decorator, which allows us to take an abstraction step in the test-case writing, dividing the test logic from the data to be input. We can then verify the correct functioning of the code with different edge cases, while avoiding the duplication of logic.

import pytest

@pytest.mark.parametrize('value_A,value_B', [
    # each element of this list will provide values for the
    # topics "value_A" and "value_B" of the test and will
    # generate a stand-alone test case.
    ('first case', 1),
    ('second case', 2),
])
def test_func(value_A, value_B):
    result = func(value_A, value_B)
    # assertions

In the example, test_func will be performed twice, the first with value_A = 'first case', value_B = 1 and the second with value_A = 'second case', value_B = 2.

During execution of the tests, the various parameters will be considered as independent test-cases and, in the event of failure, an identifier containing the data provided allows the developer to quickly trace the problematic case.

test_pytest.py .F

==================================== FAILURES ===================================

____________________________ test_func[second case-2] ___________________________

...

Faker

Faker provides methods to spontaneously create plausible data for our tests.

from faker import Faker
fake = Faker()
fake.name()
# Wendy Lopez'

# Two different calls to `.name ()` will give different results,
# that are randomly generated

fake.name()
# 'Nancy Oconnell'

fake.address()
# '88573 Hannah Track\nNew Rebeccatown, FM 61000'

fake.email()
# 'aaron88@gmail.com'

The data is generated by Providers included in the library (a complete list in the documentation), but it is also possible to create custom ones.

from faker.generator import random
from faker.providers import BaseProvider

class UserProvider(BaseProvider):
    """Fake data provider for users."""

    users = (
        # some valid data for our application
        {
            'name': 'Foo Bar',
            'email': 'foo@bar.com',
            'description': fake.text(),
        },
        {
            'name': 'Bar Foo',
            'email': 'bar@foo.com',
            'description': fake.text(),
        },
    )

    @classmethod
    def user(cls):
        """Randomly select a user."""
        return random.choice(cls.users)

then usable by adding them to the global object the library is based on:

from faker import Faker
fake = Faker()
fake.add_provider(UserProvider)
fake.user()
# {
#    'description': 'Cum qui vitae debitis. Molestiae eum totam eos inventore odio.',
#    'email': 'bar@foo.com',
#    'name': 'Bar Foo'
# }

To understand certain cases where Faker can come in handy, let’s suppose for example that you want to perform tests to verify the correct creation of users into a database.

In this case, one possibility would be to recreate the database each time the test suite is run. However, creating a database is usually an operation that takes time, so it would be preferable to create it only the first time, perhaps using a dedicated command line option. The problem here is that, if we use hardcoded data in the testcase and if there is some kind of constraint on the users (for example, the unique email), the test would fail if run twice on the same database. With Faker we can easily avoid these conflicts because instead of the explicit data we have a function call that returns different data each time.

In this case, however, we renounce the reproducibility of the test: as the values of Faker are chosen in a random manner, a value that shows an error in the code could be randomly provided or not, so the execution of the test would generate different results in an unpredictable way.

Hypothesis

Hypothesis is a data generation engine. The programmer, in this case, establishes the criteria with which the data must be generated and the library deals with generating examples (the terminology used in this library is inspired by the scientific world. The data generated by Hypothesis are called “examples”. We will also see other keywords such as “given”, “assume”… that respect the given criteria).

For example, if we want to test a function that takes integers, it will be sufficient to apply the given decorator to the test and to pass to it the integers strategy. In the documentation you will find all the strategies included in the library.

from hypothesis import given
from hypothesis import strategies as st

@given(value_A=st.integers(), value_B=st.integers())
def test_my_function(value_A, value_B):
    ...

The test test_my_function takes two parameters in input, value_A and value_B. Hypothesis, through the given, decorator, fills these parameters with valid data, according to the specified strategy.

The main advantage over Faker is that the test will be run numerous times, with combinations of values value_A and value_B that are different each time. Hypothesis is also designed to look for edge cases that could hide errors. In our example, we have not defined any minor or major limit for the integers to be generated, so it is reasonable to expect that among the examples generated we will find, in addition to the simplest cases, the zero and values high enough (in absolute value) to generate integer overflow in some representations.

These are some examples generated by the text strategy:

@given(word=st.text())
def test_my_function(word):
    print(word)
# 㤴
# 󗖝򯄏畧
# 񗋙጑㥷亠󔥙󇄵닄
# 򂳃
# φક񗋙뭪䰴
# Sφ镇ʻ䓎܆Ą
# ʥ?ËD
# ?S®?
# ®?WS
# ýÆ!깱)󧂏񕊮𽌘
# ý
# ;®僉
# ®ý
# ;ፍŵ򆩑
...

(yes, most of these characters don’t even display in my browser)

Delegating to an external library the task of imagining possible limit cases that could put our code in difficulty is a great way to find possible errors that were not thought of and at the same time to maintain the code of the lean test.

Note that the number of test runs is not at the discretion of the programmer. In particular, through the settings decorator it is possible to set a maximum limit of examples to be generated

@settings(max_examples=100)
@given(...)
def test_foo():
    ...

but this limit could still be exceeded if the test fails. This behaviour is due to another feature of Hypothesis: in case of failure, a test is repeated with increasingly elementary examples, in order to recreate (and provide in the report) the simplest example that guarantees a code failure.

In this case, for example, Hypothesis manages to find the limit for which the code actually fails:

def check_float(value):
    return value < 9999.99

@given(value=st.floats())
def test_check_float(value):
    assert check_float(value)

test_check_float()
# Falsifying example: test_check_float(value=9999.99)

A slightly more realistic example can be this:

def convert_string(value):
    return value.encode('utf-8').decode('latin-1')

@given(value=st.text())
def test_check_string(value):
    assert convert_string(value) == value

test_check_string()
# Falsifying example: test_check_string(value='\x80')

Hypothesis stores in its cache the values obtained from the “falsification” process and provides them as the very first examples in the subsequent executions of the test, to allow the developer to immediately verify whether a previously revealed bug has been solved or not. We therefore have the reproducibility of the test for the examples that caused failures. To formalise this behaviour and find it even in a non-local environment, like a continuous integration server, we can specify with the decorator examples a number of examples that will always be executed before those that are randomly generated.

from hypothesis import example

@given(value=st.text())
@example(value='\x80')
def test_check_string(value):
    ...

example is also an excellent “bookmark” for those who will read the code in the future, as it highlights possible misleading cases that could be missed at first sight.

Hypothesis: creating personalised strategies

All this is very useful, but often in our tests we need more complex structures than a simple string. Hypothesis involves the use of certain tools to generate complex data at will.

To start, the data output from a strategy can be passed to a map or from a filter.

# it generates a list of integers and applies `sorted` to it
st.lists(st.integers()).map(sorted).example()
# [-5605, -174, -144, 23, 76, 114, 234, 4258638726192386892599]

# it generates an integer, but filtering out the cases in
# which the whole proposed by the strategy is greater than 11.
st.integers().filter(lambda x: x > 11).example()
# 236

Another possibility is to link multiple strategies, using flatmap.

In the example the first call to st.integers determines the length of the lists generated by st.lists and places a maximum limit of 10 elements for them, excluding however lists with a length equal to 5 elements.

n_length_lists = st.integers(min_value=0, max_value=10).filter(
    lambda x: x == 5).flatmap(
    lambda n: st.lists(st.integers(), min_size=n, max_size=n)
)

n_length_lists.example()
# [[11058035345005582727749250403297998096], [219], [-170], [-5]]

For more complex operations, we can instead use the strategies.composite decorator, which allows us to obtain data from existing strategies, to modify them and to assemble them in a new strategy to be used in tests or as a brick for another custom strategy.

For example, to generate a valid payload for a web application, we could write something like the following code.

Suppose the payloads we want to generate include a number of mandatory and other optional fields. We then construct a payloads strategy, which first extracts the values for the mandatory fields, inserts them into a dictionary and, in a second phase, enriches this dictionary with a subset of the optional fields.

from hypothesis import assume

@st.composite
def payloads(draw):
    ""”Custom strategy to generate valid payloads for a web app.""”

    # required fields
    payload = draw(st.fixed_dictionaries({
        'name': st.text(),
        'age': st.integers(min_value=0, max_value=150),
    }))

    # optional fields
    # note: `subdictionaries` is not a library function, 
    #       we will write to you at times

    payload.update(draw(subdictionaries({
        'manufacturer': st.text(),
        'address': st.text(),
        'min': st.integers(min_value=0, max_value=10),
        'max': st.integers(min_value=0, max_value=10),
    })))

    if 'min' in payload and 'max' in payload:
        assume(payload['min'] <= payload['max'])

    return payload

In the example we also wanted to include assume, which provides an additional rule in data creation and can be very useful.

All that remains is for us to define subdictionaries: a utility function, usable both as a stand-alone strategy and as a component for other customised strategies.

Our subdictionaries is little more than a call to random.sample(), but using the randoms strategy we get that Hypothesis can handle the random seed and thus treat the personalised strategy exactly like those of the library, during the process of “falsification” of the failed test-cases.

@st.composite
def subdictionaries(draw, complete):
    """Strategy to create a dictionary containing a subset of the input dictionary values."""

    length = draw(st.integers(min_value=0, max_value=len(complete)))
    subset = draw(st.randoms()).sample(complete.keys(), length)
    return draw(st.fixed_dictionaries({k: complete[k] for k in subset}))

In both functions a draw argument is taken in input, which is managed entirely by the given decorator. The use of the payload strategy will therefore be of this type:

@given(payload=payloads())
def test_successful_call(payload):
    print(payload)
    # {'address': 'd', 'age': 84, 'max': 2, 'name': ''}

The creation of customised strategies lends itself particularly well to testing the correct behaviour of the application, while to verify the behaviour of our code in the case of specific failures it could become overly burdensome. We can however reuse the work performed to write the custom strategy and to alter the data provided by Hypothesis such as to cause the failures we want to verify.

@pytest.mark.parametrize('missing_field', ['name', 'age']
def test_missing_values(missing_field):
    # obtains a single example from the Hypothesis strategy
    payload = payloads().example()
    del payload['missing_field']
    # calls the web app
    # statements about errors returned

It is possible that, as the complexity and the nesting of the strategies grow, the data generation may become slower, to the point of causing a Hypothesis inner health check to fail:

hypothesis.errors.FailedHealthCheck: Data generation is extremely slow

However, if the complexity achieved is necessary for this purpose, we can suppress the control in question for those single tests that would risk random failures, by meddling with the settings decorator:

from hypothesis import HealthCheck
@settings(
    max_examples=100,
    suppress_health_check=[HealthCheck.too_slow],
)

@given(payload=payloads())
def test_very_nested_payload(payload=payloads())

Concluding

These are just some of the tools available for data-driven testing in Python, being a constantly evolving environment. Of pytest.parametrize we can state that it is a tool to bear in mind when writing tests, because essentially it helps us to obtain a more elegant code.

Faker is an interesting possibility, it can be used to see scrolling data in our tests, but it doesn’t add much, while Hypothesis is undoubtedly a more powerful and mature library. It must be said that writing strategies for Hypothesis are an activity that takes time, especially when the data to be generated consists of several nested parts; but all the tools needed to do it are available. Hypothesis is perhaps not suitable for a unit-test written quickly during the drafting of the code, but it is definitely useful for an in-depth analysis of its own sources. As often happens in Test Driven Development, the design of tests helps to write better quality code immediately: Hypothesis encourages the developer to evaluate those borderline cases that sometimes end up, instead, being omitted.