Distinguish between legitimate owls and
SPAM (Strigidae* Proliferating Adverse Messages)
while preserving privacy of owl communications
The problem is getting out of hand
You try a rule-based approach
You have to collect some private information
Don't collect private information
Determine which tokens are sensitive
Remove all names of Hogwarts students and staff
Use the structure of the documents
hermione.granger@aowl.co.uk
([a-z0-9\.\-\+])*@([a-z0-9])*(.([a-z])*)+
+ improved accuracy on your use-case
- requires manual annotation
ID | Age | Gender | House | Magical Disease |
---|---|---|---|---|
1 | 15 | M | Slytherin | Dragon pox |
2 | 19 | M | Hufflepuff | Black cat flu |
3 | 12 | F | Gryffindor | Levitation sickness |
4 | 18 | F | Slytherin | Petrification |
5 | 14 | M | Gryffindor | Hippogriff bite |
6 | 14 | M | Gryffindor | Dragon pox |
7 | 19 | M | Ravenclaw | Black cat flu |
8 | 13 | F | Ravenclaw | Levitation sickness |
9 | 17 | F | Slytherin | Lycanthropy |
10 | 15 | M | Gryffindor | Hippogriff bite |
What can you tell me about Harry, a 15-year-old Gryffindor boy?
Some combinations of pseudo-identifiers may be unique
No combination of public attributes
singles out less than k rows in your dataset
Use ranges or aggregates instead of exact values
ID | Age | Gender | House | Magical Disease |
---|---|---|---|---|
1 | 15-20 | M | Slyth./Griff. | Dragon pox |
2 | 15-20 | M | Huff./Rav. | Black cat flu |
3 | 10-14 | F | Griff./Rav. | Levitation sickness |
4 | 15-20 | F | Slyth./Griff. | Petrification |
5 | 10-14 | M | Slyth./Griff. | Hippogriff bite |
6 | 10-14 | M | Slyth./Griff. | Dragon pox |
7 | 15-20 | M | Huff./Rav. | Black cat flu |
8 | 10-14 | F | Griff./Rav. | Levitation sickness |
9 | 15-20 | F | Slyth./Griff. | Lycanthropy |
10 | 15-20 | M | Slyth./Griff. | Hippogriff bite |
What can you tell me about Luna, a 13-year-old Ravenclaw girl?
Each k-anonymous group contains at least
l different values for a given sensitive attribute
ID | Age | Gender | House | Magical Disease |
---|---|---|---|---|
3 | 10-14 | F | \ | Levitation sickness |
5 | 10-14 | M | \ | Hippogriff bite |
6 | 10-14 | M | \ | Dragon pox |
8 | 10-14 | F | \ | Levitation sickness |
12 | 10-14 | M | \ | Common cold |
18 | 10-14 | F | \ | Levitation sickness |
21 | 10-14 | M | \ | Black cat flu |
25 | 10-14 | F | \ | Levitation sickness |
28 | 10-14 | F | \ | Dragon pox |
30 | 10-14 | M | \ | Hippogriff bite |
What can you tell me about Luna, a 13-year-old Ravenclaw girl?
l-diversity does not protect from statistical attacks
The distribution of the sensitive attribute in any group
is close to the overall distribution of the attribute
Trade-off between dataset value and privacy
Minerva McGonagall
head of the Gryffindor House
Extract pseudo-identifiers
Mask or aggregate them
Authorship attribution algorithms...
Mathematical framework to quantify the privacy loss
Results should not depend "too much"
on the data of any one individual
Did you cheat to your OWL test?
Apply a differential private procedure to anonymize a dataset
Well, maybe you do want to buy a Nimbus 2000 after all...
k-anonymity: a model for protecting privacy, Latanya Sweeney (2002)
Mondrian Multidimensional k-Anonymity, Kristen LeFevre, David J. DeWitt and Raghu Ramakrishnan (2006)
t-Closeness: Privacy beyond k-anonymity and l-diversity, Ninghui Li, Tiancheng Li and Suresh Venkatasubramanian (2007)
The Algorithmic Foundations of Differential Privacy, Cynthia Dwork and Aaron Roth (2014)
Deep learning with differential privacy, Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar and Li Zhang (2016)
Semi-supervised knowledge transfer for deep learning from private training data, Nicolas Papernot, Martín Abadi, Ulfar Erlingsson, Ian Goodfellow and Kunal Talwar (2016)
Beyond Inferring Class Representatives: User-Level Privacy Leakage From Federated Learning, Zhibo Wang, Mengkai Song, Zhang Zhifei, Yang Song, Qian Wang and Hairong Qi (2019)
Practical secure aggregation for privacy-preserving machine learning, Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal and Karn Seth (2017)