“Synthesising data in a trusted execution environment allows organisations who do not have data science capabilities to buy synthesis as a service while knowing that the individual data values will not be accessible to the service provider.”
General Data Protection Regulation (GDPR) does not allow the testing of information systems on personal data. This hinders innovation and the development of new data-driven services (including artificial intelligence). Several European data protection authorities have indicated that synthetic data is not identifiable in the sense of Recital 26 of the GDPR and can be used for testing IT systems. Synthetic data still has to be similar to the original.
New data can be synthesised on the basis of existing data. However, organisations lack the competence and are willing to buy this as a service. Such synthesis requires processing original data. This needs a lawful basis and trust, making it hard to outsource.
In this project, we tested data synthesis using secure computing technology that protects original data from the service provider so that synthesis could even be a cloud service that is unable to leak the source data values.
In September, we held a webinar on this topic, discussing various angles of data synthesis. Watch it again it here.
Liina Kamm, our senior researcher in this project, lets us in on the findings and future prospects of DANCE.
Please describe the most notable findings of this research.
The project resulted in a data synthesis service prototype that uses trusted execution environments to provide an extra layer of privacy to the data synthesis process. Trusted execution environments (TEEs) allow the data to be analysed without anyone having access to individual values. Data synthesis is a process where a machine learning model is trained on real data and then synthetic data are generated based on this model. The data will have the same correlations and distribution as the original data but the values will not be real values. We researched the background on both TEEs and data synthesis to choose the best methods for combining the two privacy-enhancing technologies (PETs). We found that it is possible to combine the technologies, however it still requires further research to optimise this process. We also found that generally organisations need tailored data synthesis solutions. Our prototype solution is meant to be used as a service with added data privacy techniques. It currently does not allow for overly complicated database structures, as these would be difficult for a user to define in an online service. Additionally, we found that organisations are looking for solutions that they can use on premises to avoid data being transferred outside their jurisdiction.
How will the findings affect the practises of data analysis and the field in general in the next 5 years?
Data synthesis supports the data minimisation principle that GDPR requires by enabling preliminary data analysis (e.g., preliminary machine learning model training) or software testing to be carried out using artificial data instead of real data. Synthesising data in a trusted execution environment allows organisations who do not have data science capabilities to buy synthesis as a service while knowing that the individual data values will not be accessible to the service provider.
In the next 3-5 years we see active deployment of PETs to further ensure data privacy. Organisations can create synthetic twins of their datasets to enable software testing on data that has similar properties to real data. Preliminary studies on the applicability of machine learning can be carried out on synthetic data. This will enable better evaluation of whether machine learning can be used for solving a certain problem before giving access to real data.
Will there be any follow up projects?
We will continue developing and optimising the service. We also have several projects where we are studying the use of synthetic data (TEADAL, LAGO). The results will also be used as input for the PETs roadmap project.
This research was supported by the ETAg proof-of-concept grant EAG189.