A few of my favorite things

About me

  • Data Scientist
  • Intensive use of Python for a full year
  • Lesser-known Python libraries which I find tremendously useful

Tenacity

What could possibly go wrong?

Tenacity

Because remote access do not always work on the first time

Issue Timeout exception management is boring, nested exception management is a pain.

Pros Simply use a decorator on your function, easy customisation to get the behaviour you need

Tenacity

In [1]:
from tenacity import retry

@retry
def load_data_from_remote_db(params):
    pass

Tenacity

In [2]:
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type


class TimeOut(Exception):
    pass


@retry(stop=stop_after_attempt(10), 
       wait=wait_exponential(multiplier=2, max=10), 
       retry=retry_if_exception_type(TimeOut), 
       reraise=True)
def load_data_from_remote_db(params):
    pass

Hypothesis

Better testing with less effort

Hypothesis

Old-school tests

  1. Create some data
  2. Apply the function
  3. Assert something about the result

Hypothesis

Tests with Hypothesis

  1. Specify your input data
  2. Apply the function
  3. Assert something about the result

Hypothesis

Because generating data for testing is boring, and humans always miss an interesting case

Issue My code is very much data dependent, but I cannot cover all possible cases. Also, data sanity checks.

Pros Focus on specifying your data; randomly chosen parameters; great for data health checks

Hypothesis

In [3]:
from hypothesis import given, assume, example
from hypothesis.strategies import integers, floats, sampled_from

possible_values = ['auto', 'cauchy', 'schwartz']

@given(param1=sampled_from(possible_values),
       param2=floats(min_value=-2.3, max_value=2.5),
       begin=integers(min_value=0, max_value=25),
       end=integers(min_value=2, max_value=32))
@example(param1='auto', param2=0, begin=0, end=1)
def parametrised_test_on_data(param1, param2, begin, end):
    assume(begin < end)
    assume(end - begin < 10)
    # do stuffs 

Hypothesis

Bonus round hypothesis.extra.numpy generates numpy arrays !

TDDA

The library I wish I had written

TDDA

Because data dependency makes code harder to test

  • Reference tests: tdda.referencetest
  • Automatic discovery of constraints: tdda.constraints

TDDA

tdda.referencetest

Issue Reference tests are sometimes the best we have, but their maintenance is time-consuming.

Pros Built-in comparisons of pandas DataFrames, CSV files, and text files, easy update of references

TDDA

In [4]:
import pytest
from tdda.referencetest import referencepytest

def pytest_addoption(parser):
    referencepytest.addoption(parser)

@pytest.fixture(scope='module')
def ref(request):
    r = referencepytest.ref(request)
    r.set_data_location('testdata')
    return r

TDDA

In [5]:
import pandas as pd

def produce_data_somehow():
    return df

def test_produce_data_somehow(ref):
    resultframe = produce_data_somehow()  
    ref.assertDataFrameCorrect(resultframe, 'result.csv')    

TDDA

Rewrite references files if your code has changed : pytest --write-all -s

TDDA

tdda.constraints

Issue We should all check the distributions of our datasets, but we rarely do.

Pros Automatic generation of constraings, which can be manually curated afterwards

TDDA

tdda discover input-file constraints.tdda

tdda verify input-file constraints.tdda

TDDA

In [6]:
import pandas as pd
from tdda.constraints.pdconstraints import discover_constraints

df = pd.DataFrame({'a': [1, 2, 3], 'b': ['one', 'two', pd.np.NaN]})
constraints = discover_constraints(df)
with open('example_constraints.tdda', 'w') as f:
    f.write(constraints.to_json())

TDDA

Bonus round engarde is another nice library for checking your datasets, using decorators (but written by hand)

In [7]:
import engarde.decorators as ed

dtypes = dict(
    col1=int,
    col2=int)

@ed.is_shape((None, 10))
@ed.has_dtypes(items=dtypes)
@ed.none_missing()
@ed.within_range({'col3':[0, 150]})
def load_df():
    return df

That's all folks !