A few of my favorite things¶

About me¶

Data Scientist
Intensive use of Python for a full year
Lesser-known Python libraries which I find tremendously useful

Tenacity¶

What could possibly go wrong?

Tenacity¶

Because remote access do not always work on the first time

Issue Timeout exception management is boring, nested exception management is a pain.

Pros Simply use a decorator on your function, easy customisation to get the behaviour you need

Tenacity¶

In [1]:

from tenacity import retry

@retry
def load_data_from_remote_db(params):
    pass

Tenacity¶

In [2]:

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type


class TimeOut(Exception):
    pass


@retry(stop=stop_after_attempt(10), 
       wait=wait_exponential(multiplier=2, max=10), 
       retry=retry_if_exception_type(TimeOut), 
       reraise=True)
def load_data_from_remote_db(params):
    pass

Hypothesis¶

Better testing with less effort

Hypothesis¶

Old-school tests

Create some data
Apply the function
Assert something about the result

Hypothesis¶

Tests with Hypothesis

Specify your input data
Apply the function
Assert something about the result

Hypothesis¶

Because generating data for testing is boring, and humans always miss an interesting case

Issue My code is very much data dependent, but I cannot cover all possible cases. Also, data sanity checks.

Pros Focus on specifying your data; randomly chosen parameters; great for data health checks

Hypothesis¶

In [3]:

from hypothesis import given, assume, example
from hypothesis.strategies import integers, floats, sampled_from

possible_values = ['auto', 'cauchy', 'schwartz']

@given(param1=sampled_from(possible_values),
       param2=floats(min_value=-2.3, max_value=2.5),
       begin=integers(min_value=0, max_value=25),
       end=integers(min_value=2, max_value=32))
@example(param1='auto', param2=0, begin=0, end=1)
def parametrised_test_on_data(param1, param2, begin, end):
    assume(begin < end)
    assume(end - begin < 10)
    # do stuffs

Hypothesis¶

Bonus round hypothesis.extra.numpy generates numpy arrays !

TDDA¶

The library I wish I had written

TDDA¶

Because data dependency makes code harder to test

Reference tests: tdda.referencetest
Automatic discovery of constraints: tdda.constraints

TDDA¶

tdda.referencetest

Issue Reference tests are sometimes the best we have, but their maintenance is time-consuming.

Pros Built-in comparisons of pandas DataFrames, CSV files, and text files, easy update of references

TDDA¶

In [4]:

import pytest
from tdda.referencetest import referencepytest

def pytest_addoption(parser):
    referencepytest.addoption(parser)

@pytest.fixture(scope='module')
def ref(request):
    r = referencepytest.ref(request)
    r.set_data_location('testdata')
    return r

TDDA¶

In [5]:

import pandas as pd

def produce_data_somehow():
    return df

def test_produce_data_somehow(ref):
    resultframe = produce_data_somehow()  
    ref.assertDataFrameCorrect(resultframe, 'result.csv')

TDDA¶

Rewrite references files if your code has changed : pytest --write-all -s

TDDA¶

tdda.constraints

Issue We should all check the distributions of our datasets, but we rarely do.

Pros Automatic generation of constraings, which can be manually curated afterwards

TDDA¶

tdda discover input-file constraints.tdda

tdda verify input-file constraints.tdda

TDDA¶

In [6]:

import pandas as pd
from tdda.constraints.pdconstraints import discover_constraints

df = pd.DataFrame({'a': [1, 2, 3], 'b': ['one', 'two', pd.np.NaN]})
constraints = discover_constraints(df)
with open('example_constraints.tdda', 'w') as f:
    f.write(constraints.to_json())

TDDA¶

Bonus round engarde is another nice library for checking your datasets, using decorators (but written by hand)

In [7]:

import engarde.decorators as ed

dtypes = dict(
    col1=int,
    col2=int)

@ed.is_shape((None, 10))
@ed.has_dtypes(items=dtypes)
@ed.none_missing()
@ed.within_range({'col3':[0, 150]})
def load_df():
    return df

That's all folks !