What could possibly go wrong?
Because remote access do not always work on the first time
Issue Timeout exception management is boring, nested exception management is a pain.
Pros Simply use a decorator on your function, easy customisation to get the behaviour you need
from tenacity import retry
@retry
def load_data_from_remote_db(params):
pass
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
class TimeOut(Exception):
pass
@retry(stop=stop_after_attempt(10),
wait=wait_exponential(multiplier=2, max=10),
retry=retry_if_exception_type(TimeOut),
reraise=True)
def load_data_from_remote_db(params):
pass
Better testing with less effort
Tests with Hypothesis
Because generating data for testing is boring, and humans always miss an interesting case
Issue My code is very much data dependent, but I cannot cover all possible cases. Also, data sanity checks.
Pros Focus on specifying your data; randomly chosen parameters; great for data health checks
from hypothesis import given, assume, example
from hypothesis.strategies import integers, floats, sampled_from
possible_values = ['auto', 'cauchy', 'schwartz']
@given(param1=sampled_from(possible_values),
param2=floats(min_value=-2.3, max_value=2.5),
begin=integers(min_value=0, max_value=25),
end=integers(min_value=2, max_value=32))
@example(param1='auto', param2=0, begin=0, end=1)
def parametrised_test_on_data(param1, param2, begin, end):
assume(begin < end)
assume(end - begin < 10)
# do stuffs
Bonus round hypothesis.extra.numpy
generates numpy arrays !
The library I wish I had written
Because data dependency makes code harder to test
tdda.referencetest
tdda.constraints
tdda.referencetest
Issue Reference tests are sometimes the best we have, but their maintenance is time-consuming.
Pros Built-in comparisons of pandas DataFrames, CSV files, and text files, easy update of references
import pytest
from tdda.referencetest import referencepytest
def pytest_addoption(parser):
referencepytest.addoption(parser)
@pytest.fixture(scope='module')
def ref(request):
r = referencepytest.ref(request)
r.set_data_location('testdata')
return r
import pandas as pd
def produce_data_somehow():
return df
def test_produce_data_somehow(ref):
resultframe = produce_data_somehow()
ref.assertDataFrameCorrect(resultframe, 'result.csv')
Rewrite references files if your code has changed : pytest --write-all -s
tdda.constraints
Issue We should all check the distributions of our datasets, but we rarely do.
Pros Automatic generation of constraings, which can be manually curated afterwards
import pandas as pd
from tdda.constraints.pdconstraints import discover_constraints
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['one', 'two', pd.np.NaN]})
constraints = discover_constraints(df)
with open('example_constraints.tdda', 'w') as f:
f.write(constraints.to_json())
Bonus round engarde
is another nice library for checking your datasets, using decorators (but written by hand)
import engarde.decorators as ed
dtypes = dict(
col1=int,
col2=int)
@ed.is_shape((None, 10))
@ed.has_dtypes(items=dtypes)
@ed.none_missing()
@ed.within_range({'col3':[0, 150]})
def load_df():
return df
That's all folks !