Testing by users is no longer a thing
ML algorithms fail silently
Bugs in data + bugs in code = a lot of bugs
A piece of software that can either pass or fail and whose outcome is correlated with functional correctness
Testing a ML pipeline
All these operations can be written as functions.
df_filled = fill_missing_values(df_raw, how='average')
df_cleaned = remove_outliers(df_filled)
y = f(x)
-> no side effects
df = fill_missing_values(df_raw, how='average') ✔
fill_missing_values(df_raw, how='average') ✗
df = fill_missing_values('raw_df.csv', how='average') ✗
df_out, label_mapping = encode_labels(df_in) ✔
Output(s) depend only on input(s)
... test units of source code
... work great with functional programming
Think about your invariants!
Property-based testing -> hypothesis
import numpy as np
from hypothesis import given
from hypothesis.extra.numpy import arrays
def normalize(arr):
return np.zeros(arr.shape)
@given(arrays(dtype=float, shape=(3,4)))
def test_normalize_is_between_bounds(input):
output = normalize(input)
assert np.all(output >= 0)
assert np.all(output <= 1)
@given(arrays(dtype=float, shape=(3,4)))
def test_normalize_is_idempotent(a):
normalized_input = normalize(input)
output = normalize(normalized_input)
assert np.all(output == normalized_input)
Add a few trivial tests with simple examples
are hard to unit test
Unit-test what you can
... test whole chunks of your code
(data loading, preprocessing, model training...)
Fix random seed
Average multiple runs
Differential testing
Metamorphic testing
Reference tests
Check multiple implementations
against one another
Check for non-regression between
your old implementation and your new one
Check against a random or uniform baseline
Run the same implementation multiple times with modified input
\[ \Phi (f(x)) = f( \phi (x)) \]
def test_compute_variance():
x = np.random.rand(25)
assert compute_variance(x * -1) == compute_variance(x)
assert compute_variance(2 * x) == 4 * compute_variance(x)
Store input and reference output
Check that computed output matches reference output
... can be kept up-to-date with tdda
And many mitigation strategies make them slower
Have two test suites:
Functional programming won't work
because of side-effect
Mocking can help
... mimic the behaviour of an object in a controlled way
def preprocessing(df: pd.DataFrame) -> pd.DataFrame:
...
def model_training(arr: np.ndarray):
...
End-to-end tests
Automated tests are really automatic when they are automatically run
Anything is better than nothing
Else your tests won't run as often
as they should
But not hopeless
They are happy to share their knowledge
Testing and validating machine learning classifiers by metamorphic testing, X. Xie et al
On testing machine learning programs, H. Ben Braieka and F. Khomh