With the growth of “big data”, there are ever more opportunities for societally useful knowledge to be gleaned from large datasets whether healthcare data, emerging social issues or any number of other areas. To allow these opportunities to be exploited people need to be persuaded to contribute their information and therefore need to be reassured that their privacy will not be breached by participating in these datasets. This has been put in the spotlight due to well publicised de-anonymisation attacks as with the highly publicised cases of the AOL and Netflix dataset attacks. These used ad-hoc methods of privacy protection, however differential privacy has emerged as a statistically provable method for preserving the privacy of contributors to a database.
Differential privacy relies on a value for a variable (ε) being chosen to trade-off between privacy and accuracy to the original dataset. Choosing this value has largely been a process of trial and error to this point, although a number of papers have put this on a firmer footing. This project has two aims: firstly to evaluate the combination of machine learning and differential privacy, and secondly to analyse the impact of different values of ε on the accuracy of predictions. The approaches of machine learning and differential privacy share a key similarity, with both trying to extract information from datasets and generalise from this. Using a number of machine learning tools, Infer.NET and Tabular, this project aims to analyse the impact of different values of ε on the accuracy of predictions made by machine learning tools as well as provide a tool to automate this analysis in the future. This should allow researchers to see the impact on classification predictions of varying values of ε and to choose what value they deem acceptable, weighing up privacy vs. accuracy.