Privacy Preservation in Web 3.0

How Compute-to-Data Relates to Differential Privacy and Federated Learning

6 min readJan 28, 2022

Introduction

We outline the key similarities and differences between Compute-2-Data, Differential Privacy, and Federated Learning. All as means to facilitate the sharing of sensitive data while preserving privacy or withholding the sensitive contents of the data while compute jobs are done on it. The comparisons will be largely informed by all three technologies being used in the contexts of Ocean Protocol, OpenMined, and Google AI respectively, in layman terms.

As things stand, sensitive data is largely transferred between parties through sharing a copy of the dataset. Parties ensure that the data falls in the right (intended) hands and is not intercepted by malicious actors who might misuse it. However, this requires a copy of the data to leave the premises of the owner. This invokes a fundamental trade-off between the benefits of sharing the data with someone, and the risks of them misusing the data. According to OpenMined, Remote Data Science alleviates this problem by making it possible for one person to answer a question using data owned by another, without ever seeing or acquiring a copy of that data. Let us investigate Compute-2-Data, Differential Privacy and Federated Learning as privacy preserving layers in transactions involving running compute jobs and analytics models on data not owned by the analysts.

OpenMined and Differential Privacy

OpenMined uses data owner-deployed servers to store data from where it can be later queried remotely by analysts who need answers to questions. In this implementation, the basic flow of how data is sent to and queried from the server looks something like this[1];

Uses HAGrid Command Line Interface to deploy and communicate with data owner server
Host your data for study on PyGrid Server
Study data remotely using PySyft Library
Privacy is preserved by Differential Privacy (DP)

Differential Privacy (DP)

Differential Privacy ensures privacy by adding random noise to the query. This strikes a balance between data consumers getting some utility from the data, and the data provider not having to forego a data subject’s privacy in the process. Though DP does a good job in obfuscating the sensitive contents of the data through the introduction of noise, if enough queries are run on a dataset, the output can be a correct estimate of the ground truth, thereby breaching privacy. Because, in theory, with every additional query, twice the privacy is lost [2]. This can be circumvented by allocating a ‘privacy budget’ to limit the scope of the job(s) a data consumer can perform on the data. And enforcing mandatory noise (epsilon) in every function attempting to query the data from the server.

This is done by an algorithmic coin flip to introduce ‘plausible deniability’. For example, if we have a study that has binary outcomes, Yes/No, there is at least a 1 in 4 chance that an answer could be wrong, therefore you cannot trust information by handpicking the outcome that pertains to a single individual.

To avoid exposing non-disclosed private information about a data subject, it is a rule of thumb, that an insight about a dataset must be consistent with or without the removal of a particular person’s information. Therefore, not exposing any new truth about a particular person other than what is already in the data. In other words, an analysis should tell us more about the population and nothing about a single person [3].

Ocean Protocol and Compute-2-Data

Decentralised data markets allow anyone to monetise their data as long as other market participants can agree on the value / price of that data. And as a result, opening up data sets that were previously unavailable for research and artificial intelligence. But, this has not been in the most secure and privacy-preserving manner, until the introduction of Compute-to-Data in Ocean Market.

Ocean Market is a Web3 data marketplace where data publishers allow data consumers (data scientists, researchers, etc.) to train algorithms on their data, while preserving privacy and the data never having to be copied or leave the publisher’s premises. Users can publish both data sets and algorithms — collectively referred to as data assets.

Compute-2-Data (C2D)

Compute-2-data allows for the sharing of data without the data having to leave the premises or compromise the data subjects’ privacy. Consumers purchase compute jobs on the data to improve the accuracy of their A.I models, or to derive relevant insights. Publishers can put up their own algorithms or third-party algorithms can be used to analyse the data. The image below illustrates the general user flow.

There are 2 consumption access permissions the publisher can choose from; download and compute:

Download — this access type would probably be best for non-personal data (e.g climate change related data, etc.)
Compute — compute access is best for personal data (e.g health records, etc.) whose exposure would likely pose a risk.

In compute, only the algorithm has viewing rights of the data. Therefore, no other person, except for the data owner can know who is implicated in the data set, and in what way. The analysts are only privy to the algorithms outputs and not the contents of the dataset.

Google AI & Federated Learning

According to Google AI, Federated Learning enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on device [4]. This means that the data no longer has to be copied from the device to the cloud. For the first time, the model can be trained on the device while analysts are only privy to the model updates.

The image above demonstrates how this is achieved in the following steps:

Device downloads current model
The model learns on the device and is improved
A summary of changes as a small focused update is derived

The punchline is that only the update is sent to the cloud. The update is averaged with other devices’ updates in the network to apply improvements to the shared model. The improved model is immediately available to augment the personalisation of the user experience. The model is trained while the phone is idle, in the charger, and on a free wireless connection, so mobile phone performance is not negatively affected[4]. To ensure privacy and security, Google AI developed a Secure Aggregation protocol, enabling a coordinating server to only decrypt the average update. Meaning that no update about a single individual can be captured. Currently, this development is mostly around Gboard functionalities in Android-powered mobile phones.

According to Trent McConaghy of Ocean Protocol, Federated Learning as implemented by OpenMined could be further improved by Compute-2-Data to manage computation at each silo in a more secure fashion. In fact, we may see some incarnation of a collaboration between Federated Learning & Compute-2-Data in the upcoming Ocean V4 release. Differential Privacy holds potential for Compute-to-Data contexts too.