Cybernetica is developing a data synthesis prototype for new data-driven services

In September 2021, Cybernetica started a proof-of-concept research project to develop methods for data protection aware synthesis of test databases (DANCE). Data protection regulation restricts the testing of information systems with real data, if they relate to identifiable individuals. This complicates the development of new data-driven services (including ones using machine learning models). Synthetic data generation can help deal with these concerns.

New random synthetic data can be generated on the basis of existing data. Synthetic data are similar in structure to the original dataset, retaining the distributions within one attribute but also the relationships between attributes. First, rules are defined or a machine learning model is trained based on the original dataset, and then synthetic data are created using the rules or the machine learning model. Several European data protection authorities have indicated that randomly synthesised data are not identifiable in the sense of Recital 26 of General Data Protection Regulation (GDPR) and can be used for testing IT systems.

However, organisations often lack the in-house competence for using data synthesis tools and are willing to use data synthesis as a service. Such synthesis requires the processing of original data and this, in turn, needs trust and a lawful basis.

A trusted execution environment (TEE) is a secure area on the processor. The TEE guarantees that the data and all computations performed on the data within the TEE are not visible to anyone, not even the system administrator.

In project DANCE, we have created a service prototype that shows that test data synthesis using trusted execution environments is possible and feasible. This way the original data are protected from the service provider so that synthesis could be a cloud service that cannot leak the source data values. We are also conducting a legal analysis to provide insight into the prerequisites of using such a service on personally identifiable information. The manager of the project is our Senior Researcher Liina Kamm.

This research is supported by ETAg proof-of-concept grant EAG 189.