Wednesday 17 December 2014

Differential privacy

Book here and paper here

Wikipedia explains well what is differential privacy, and it is difficult to add more information about it. What I can do, is try to explain in my own words what it is differential privacy, and why it is useful for computer science.

Differential privacy was created to keep data anonymization of public records and still keep querying the data to get results. E.g., imagine there's a table that contains a list of persons that have diabetes, and this list is sorted alphabetically, and the user is just allowed to do statistical queries, but never query an individual, or even get the names.

Name Diabetes
Peter 0
John 1
Rachel 0
Zoe 1

Imagine that there is an hacker that want to know if Zoe have diabetes, and he knows that the Zoe's entry is in the last place. He could do 2 queries that would do partial sums, and find the result. Let's call this query Q(i) that sums the values until position i. He could do Q(4) − Q(3) and now he knows if Zoe has diabetes or not.

Differential privacy basically protects sensitive data to keep anonymous, and even if an hacker would make two queries to get the Zoe's value, he would never get the answer. This technique uses terms like ε-differential privacy and sensitivity.

"'Differential' refers to the difference between two worlds — one in which you allow your sensitive data to be included in the database and one in which you don't," McSherry said. The two worlds cannot be made to work out exactly the same, but they can be made close enough that they are effectively indistinguishable.

ε-differential privacy

The ε-differential privacy guarantees that the presence or absence of an individual entry in the database will not alter the output, and thus assures privacy of individual information in an theoretic sense. Basically, the presence or the absence of Zoe entry in the database will not change significantly the result. If it would, the hacker could do the mentioned attack to get Zoe's info. The algorithms that are created to deal with differential privacy, they try to give accuracy of the statistics estimated in a privacy-preserving manner.

We can consider differential privacy at different levels of granularity. Lets suppose that we have a graph database that encode a social network: each individual is represented by a vertex in the graph, and friendships between individuals are represented by edges. We could consider differential privacy at a level of granularity corresponding to individuals: that is, we could require that differentially private algorithms be insensitive to the addition or removal of any vertex from the graph. This gives a strong privacy guarantee, but might in fact be stronger than we need. The addition or removal of a single vertex could after all add or remove up to n edges in the graph. Depending on what it is, we hope to learn from the graph. Insensitivity to n edge removals might be an impossible constraint to meet. We could on the other hand consider differential privacy at a level of granularity corresponding to edges, and ask our algorithms to be insensitive only to the addition or removal of single, or small numbers of edges from the graph. This is of course a weaker guarantee, but might still be sufficient for some purposes.

Informally speaking, if we promise ε-differential privacy at the level of a single edge, then no data analyst should be able to conclude anything about the existence of any subset of 1 / ε edges in the graph. In some circumstances, large groups of social contacts might not be considered sensitive information: for example, an individual might not feel the need to hide the fact that the majority of his contacts are with individuals in his city or workplace, because where he lives and where he works are public information. On the other hand, there might be a small number of social contacts whose existence is highly sensitive In this case, edge privacy should be sufficient to protect sensitive information, while still allowing a fuller analysis of the data than vertex privacy. Edge privacy will protect such an individual's sensitive information provided that he has fewer than 1/ε such friends.

When ε is small there is a better accuracy. In detail, (ε, 0)-differential privacy asserts that for all pairs of adjacent databases x, y and all outputs o, an adversary cannot distinguish which is the true database on the basis of observing o. When ε is large, it just says there exist neighboring databases and an output o for which the ratio of probabilities of observing o conditioned on the database being, respectively, x or y, is large. An output of o might be very unlikely to happen in a real world. The databases had to be deliberately contrived. There is a growing literature on improving the accuracy for a certain ε. Notice that there is not yet a lower bounds on the accuracy for a given ε, because there exists many ways to add noises to the output.

Sensitivity

A function's sensitivity measures the maximum change in the function's output when any single item is removed from or added to its dataset. Lets consider two examples to understand how sensitivity works.

Consider the function SUM that take integers as inputs and return the sum of them. If all the inputs are 0 and 1, then the sensitivity of the function is low, because the results varies at most 1. Therefore, only little noise needs to be added to the sum to achieve privacy. On the other hand, if we have one input of 1000 and the rest are 0 and 1, a lot of noise must be added to the output in order to hide whether the 1000 was among the inputs.

Differential privacy works best for low-sensitivity computations, where the maximum influence any given input can have on the output of the computation is low.

Sensitivity on aggregated data

The Subsample and Aggregate technique yields a method for "forcing" the computation of a function f(x) to be insensitive, even for an arbitrary function f. Thus, proving privacy will be trivial.

Accuracy depends on properties of the function f and the specific data set x. In particular, if f(x) can be accurately estimated with high probability on f(S), where S is a random subset of the elements in x, then accuracy should be good.

In this Figure, several f computed partially the data. The function f is computed exactly, without noise, independently on each block. The intermediate outcomes f(B1), ... , f(Bm) are then combined via a differentially private aggregation mechanism — typical examples include standard aggregations, such as the α-trimmed mean 1, the Winsorized mean 2 and the median, but there are no restrictions — and then adding Laplace noise scaled to the sensitivity of the aggregation function in question.

The key observation in Subsample and Aggregate is that any single element can affect at most one block, and therefore the value of just a single f(Bi). Thus, changing the data of any individual can change at most a single input to the aggregation function. Even if f is arbitrary, the analyst chooses the aggregation function, and so is free to choose one that is insensitive, provided that choice is independent of the database. Privacy is therefore immediate.

Much work in statistics and machine learning addresses the problem of model selection: Given a data set and a discrete collection of “models,” each of which is a family of probability distributions, the goal is to determine the model that best "fits" the data. The function f might be choosing the best model from the given set of m models, a process known as model fitting, via an arbitrary learning algorithm. The choice of aggregation function can even depend on the database, but the selection must be made in a differentially private fashion. The privacy cost is then the cost of composing the choice operation with the aggregation function.

Where differential privacy is used?

Anonymizing public records makes sense if we are dealing with biobank or customers data. The database of this data have a size of tera, petabytes, or even more. Hospital, or private companies query these sensitive data to use it in a new research or market study, and sometimes the data is spread around different clouds. The differential privacy is an important issue in the era of big data, because there is the need to access this data without disclosing personal information.

A biobank is a type of biorepository that stores biological samples (usually human) for use in research. When there is the need to do some study using biobanks, there is the need to not disclose private data. Institutions that hold these repositories, even governments, are very sensitive to disclosing data. There are also studies that need to aggregate data from different repositories located in different places and are connected through internet. Differential privacy is really important because it prevents that responses will not disclose private information.

What differential privacy does not promise

Differential privacy does not guarantee privacy where none previously exists. If the data that is queried is not protected with privacy techniques, there is nothing to do about it. Differential privacy just guarantees that one participation in a survey will not disclose the data based on the response.


  1. The α-trimmed mean is the mean after the top and bottom α fraction of the inputs have been discarded.

  2. The Winsorized mean is similar to the α-trimmed mean except that, rather than being discarded, the top and bottom α fraction are replaced with the most extreme remaining values.

No comments:

Post a Comment