The Sharemind team never stops experimenting and we're happy to share what we built. This time, we'll share results from a joint research project in the European Commission Horizon 2020 program called SafeCloud.
Limitations of Business Intelligence tools with Personal Data
Business Intelligence (BI) tools let their users manage their data analysis products. These tools connect to the data stores of the organisation and assist in composing queries and creating visualisations.
Most business intelligence technologies available today can protect data when it is stored or transferred, but lack functionality to protect data while it is being processed. Due to this limitation of current technologies it is common to integrate access restrictions into the report or database level, but both the databases and the report engines still require access to the confidential data, even when the result is an aggregate.
This reduces the usefulness of BI tools for service providers that need to collect and link sensitive personal data from multiple sources. Consider a company providing LIMS (Laboratory Information Management System) software to hospitals. By running analytics across data from all the hospitals, the company could improve its product and offer new insights to all parties involved. However, data protection restrictions can hinder such opportunities.
How would a BI system ideally work with sensitive data?
In order to protect privacy and confidentiality of data in business intelligence solutions, engineers need to have a detailed way of specifying who can access what records and fields of the database and at what level statistical aggregates can be calculated without compromising confidentiality and privacy requirements. It is also critical that no user or organization can single-handedly change these rules for access and that the rules and access to data are logged and easily auditable.
On top of these complications there is a serious contradiction in the usual high level specifications formulated in contracts vs. the detailed level of the BI solution design. That means the contracts as such do not represent the expectations of the client and developer adequately. This may result in erroneous design that is later costly or may cause privacy breaches. Though the legal contracts include penalties for damages for the privacy or confidentiality breaches, the real damage to reputations and careers cannot be put back in the bottle when a breach happens.
Introducing the Sharemind SQL Engine
Sharemind MPC is Cybernetica's secure Multi-party Computation product. It is a fully programmable data analysis system based on cryptographic secure computing. We have developed the Sharemind Analytics Engine that implements various privacy-preserving data manipulation and statistical functions. The Analytics Engine also powers Rmind, Sharemind's privacy-preserving statistical analysis environment.
SafeCloud is a European innovation project in the Horizon 2020 framework programme. Its goal is to develop technologies that allow companies and organisations to migrate their services to the cloud without risks of data compromise. Our goal in SafeCloud was to build an experimental SQL engine on top of Sharemind's Analytics Engine and we reached that goal successfully!
Sharemind SQL is a relational database management system which uses Sharemind MPC to store and process the data. Sharemind SQL provides secure database querying and data processing. That allows sensitive data to be aggregated for providing insights while protecting confidentiality on individual record and field level. No single person or organization can unilaterally change the technically enforced privacy and confidentiality rules and all queries are logged and auditable.
Sharemind SQL queries are executed using secure multi-party computation which means data will not be decrypted while it is processed. This is true for all operations, from filters, grouping, joins, sorting and set intersections to all the implemented aggregation functions.
Only the results of explicitly whitelisted queries will be declassified and returned to the client application. By permitting only safe queries we can keep individual records private while still allowing the business intelligence tool to create useful reports of the data. Sharemind SQL emulates the PostgreSQL protocol which allows it to be used with tools that support SQL through common APIs like JDBC and ODBC.
Integrating Sharemind SQL with the Knowage BI system
We decided to check how generic the Sharemind SQL engine really is. The best way for that would be to let an actual BI system use it as its backend. This is the best time to introduce Maxdata, the Portuguese company developing laboratory software for hospitals. We were able to collaborate with them to pick the BI tool and a scenario. Maxdata even provided us with synthetic data to run tests with.
Together, we picked Knowage, as it is a successor to SpagoBI - a BI suite that Maxdata uses. We configured Knowage with the Sharemind MPC backend over the PostgreSQL-compatible JDBC driver. Then, it was importing the data, configuring data sources and off to make the graphs. See the below video for a screencast on how this worked out.
Using Knowage with Sharemind SQL
Incompatibilities between BI systems and privacy-preserving query systems
Sharemind MPC is unique in the way that it doesn't want even the analyst to learn individual values. Analysts can combine the data collected from multiple sources, but never see it in detail. That's why the SQL engine working on Sharemind MPC does not return raw records. This is a challenge for BI systems that are used to getting raw data and working on it themselves. Fortunately, Knowage is flexible enough to accept that it won't get raw records and it has a useful option to skip internal caching and let Sharemind MPC do the aggregations.
The second aspect is query restriction - you may want to allow certain queries to return values and prevent others from doing so. For this, Sharemind's SQL engine includes a query whitelist option. All whitelisted queries return data, but others return zero rows. This allows our SQL engine to trick the BI engines into thinking that there are no records when they try to cache the data in the system, but return data when aggregated results are requested.
Conclusions
We learned a lot in the process of implementing SQL for Sharemind MPC. At the current development stage the product enables creating proof of concept projects for privacy-sensitive use cases. We will continue working on it, so developers using SQL can create privacy-preserving workflows with novel technical guarantees. In addition, we also made SQL usable from inside SecreC applications where it can be a huge shortcut for ETL tasks inside custom SecreC programs.
The other important lesson here was that while existing analytics systems are not designed for privacy, we can still get them to respect it. As usual, we encourage organizations and people to use and benefit from the emerging Privacy Enhancing Technologies by securing their confidential data.