Wednesday 17 December 2014

Differential privacy

Book here and paper here

Wikipedia explains well what is differential privacy, and it is difficult to add more information about it. What I can do, is try to explain in my own words what it is differential privacy, and why it is useful for computer science.

Differential privacy was created to keep data anonymization of public records and still keep querying the data to get results. E.g., imagine there's a table that contains a list of persons that have diabetes, and this list is sorted alphabetically, and the user is just allowed to do statistical queries, but never query an individual, or even get the names.

Name Diabetes
Peter 0
John 1
Rachel 0
Zoe 1

Imagine that there is an hacker that want to know if Zoe have diabetes, and he knows that the Zoe's entry is in the last place. He could do 2 queries that would do partial sums, and find the result. Let's call this query Q(i) that sums the values until position i. He could do Q(4) − Q(3) and now he knows if Zoe has diabetes or not.

Differential privacy basically protects sensitive data to keep anonymous, and even if an hacker would make two queries to get the Zoe's value, he would never get the answer. This technique uses terms like ε-differential privacy and sensitivity.

"'Differential' refers to the difference between two worlds — one in which you allow your sensitive data to be included in the database and one in which you don't," McSherry said. The two worlds cannot be made to work out exactly the same, but they can be made close enough that they are effectively indistinguishable.

ε-differential privacy

The ε-differential privacy guarantees that the presence or absence of an individual entry in the database will not alter the output, and thus assures privacy of individual information in an theoretic sense. Basically, the presence or the absence of Zoe entry in the database will not change significantly the result. If it would, the hacker could do the mentioned attack to get Zoe's info. The algorithms that are created to deal with differential privacy, they try to give accuracy of the statistics estimated in a privacy-preserving manner.

We can consider differential privacy at different levels of granularity. Lets suppose that we have a graph database that encode a social network: each individual is represented by a vertex in the graph, and friendships between individuals are represented by edges. We could consider differential privacy at a level of granularity corresponding to individuals: that is, we could require that differentially private algorithms be insensitive to the addition or removal of any vertex from the graph. This gives a strong privacy guarantee, but might in fact be stronger than we need. The addition or removal of a single vertex could after all add or remove up to n edges in the graph. Depending on what it is, we hope to learn from the graph. Insensitivity to n edge removals might be an impossible constraint to meet. We could on the other hand consider differential privacy at a level of granularity corresponding to edges, and ask our algorithms to be insensitive only to the addition or removal of single, or small numbers of edges from the graph. This is of course a weaker guarantee, but might still be sufficient for some purposes.

Informally speaking, if we promise ε-differential privacy at the level of a single edge, then no data analyst should be able to conclude anything about the existence of any subset of 1 / ε edges in the graph. In some circumstances, large groups of social contacts might not be considered sensitive information: for example, an individual might not feel the need to hide the fact that the majority of his contacts are with individuals in his city or workplace, because where he lives and where he works are public information. On the other hand, there might be a small number of social contacts whose existence is highly sensitive In this case, edge privacy should be sufficient to protect sensitive information, while still allowing a fuller analysis of the data than vertex privacy. Edge privacy will protect such an individual's sensitive information provided that he has fewer than 1/ε such friends.

When ε is small there is a better accuracy. In detail, (ε, 0)-differential privacy asserts that for all pairs of adjacent databases x, y and all outputs o, an adversary cannot distinguish which is the true database on the basis of observing o. When ε is large, it just says there exist neighboring databases and an output o for which the ratio of probabilities of observing o conditioned on the database being, respectively, x or y, is large. An output of o might be very unlikely to happen in a real world. The databases had to be deliberately contrived. There is a growing literature on improving the accuracy for a certain ε. Notice that there is not yet a lower bounds on the accuracy for a given ε, because there exists many ways to add noises to the output.

Sensitivity

A function's sensitivity measures the maximum change in the function's output when any single item is removed from or added to its dataset. Lets consider two examples to understand how sensitivity works.

Consider the function SUM that take integers as inputs and return the sum of them. If all the inputs are 0 and 1, then the sensitivity of the function is low, because the results varies at most 1. Therefore, only little noise needs to be added to the sum to achieve privacy. On the other hand, if we have one input of 1000 and the rest are 0 and 1, a lot of noise must be added to the output in order to hide whether the 1000 was among the inputs.

Differential privacy works best for low-sensitivity computations, where the maximum influence any given input can have on the output of the computation is low.

Sensitivity on aggregated data

The Subsample and Aggregate technique yields a method for "forcing" the computation of a function f(x) to be insensitive, even for an arbitrary function f. Thus, proving privacy will be trivial.

Accuracy depends on properties of the function f and the specific data set x. In particular, if f(x) can be accurately estimated with high probability on f(S), where S is a random subset of the elements in x, then accuracy should be good.

In this Figure, several f computed partially the data. The function f is computed exactly, without noise, independently on each block. The intermediate outcomes f(B1), ... , f(Bm) are then combined via a differentially private aggregation mechanism — typical examples include standard aggregations, such as the α-trimmed mean 1, the Winsorized mean 2 and the median, but there are no restrictions — and then adding Laplace noise scaled to the sensitivity of the aggregation function in question.

The key observation in Subsample and Aggregate is that any single element can affect at most one block, and therefore the value of just a single f(Bi). Thus, changing the data of any individual can change at most a single input to the aggregation function. Even if f is arbitrary, the analyst chooses the aggregation function, and so is free to choose one that is insensitive, provided that choice is independent of the database. Privacy is therefore immediate.

Much work in statistics and machine learning addresses the problem of model selection: Given a data set and a discrete collection of “models,” each of which is a family of probability distributions, the goal is to determine the model that best "fits" the data. The function f might be choosing the best model from the given set of m models, a process known as model fitting, via an arbitrary learning algorithm. The choice of aggregation function can even depend on the database, but the selection must be made in a differentially private fashion. The privacy cost is then the cost of composing the choice operation with the aggregation function.

Where differential privacy is used?

Anonymizing public records makes sense if we are dealing with biobank or customers data. The database of this data have a size of tera, petabytes, or even more. Hospital, or private companies query these sensitive data to use it in a new research or market study, and sometimes the data is spread around different clouds. The differential privacy is an important issue in the era of big data, because there is the need to access this data without disclosing personal information.

A biobank is a type of biorepository that stores biological samples (usually human) for use in research. When there is the need to do some study using biobanks, there is the need to not disclose private data. Institutions that hold these repositories, even governments, are very sensitive to disclosing data. There are also studies that need to aggregate data from different repositories located in different places and are connected through internet. Differential privacy is really important because it prevents that responses will not disclose private information.

What differential privacy does not promise

Differential privacy does not guarantee privacy where none previously exists. If the data that is queried is not protected with privacy techniques, there is nothing to do about it. Differential privacy just guarantees that one participation in a survey will not disclose the data based on the response.


  1. The α-trimmed mean is the mean after the top and bottom α fraction of the inputs have been discarded.

  2. The Winsorized mean is similar to the α-trimmed mean except that, rather than being discarded, the top and bottom α fraction are replaced with the most extreme remaining values.

Monday 1 December 2014

On the Efficiency of Durable State Machine Replication (SMR)

Paper here

This is a paper that tries to evaluate what happens when we apply durability (logging, checkpoints, and state transfer) in a modular way inside a SMR.

In the Paxos algorithm, we have a set of servers that run a Total Order Multicast, e.g. Paxos agreement, and this protocol ensures that all servers run the same sequence of operations. In this paper, they modified the SMR, and make it durable (executions are save in disk). Even if all machines crash, the client will see the operation result after recovery. This kind of system is important because in many papers, people assume in implementing Paxos in some critical components to avoid single point of failure. We can see this is coordination services like Zookeeper. In some sense, Paxos/SMR are a critical component of modern internet-scale infrastructure.

Durability is important because sometimes we need to shutdown the system, and correlated failures1 happen more often than expected. In practice, application, replication and durability code is mixed with the code for efficiency, and can turn the code very complex, and very difficult to make it right.

There are already papers that talks about agreement protocols (PBFT, Zyzzyva), but they care about synchronizing replicas in the normal execution and do not consider durability (logging, checkpoints, and state transfer). In this paper, they talk about SMR, but the problem of tolerating faults is orthogonal to what they show (i.e., it is not really relevant here). They considered a system model of n replicas that tolerates at most f faults (crash faults: n > 2f, or arbitrary faults: n > 3f). The important thing is that they considered non-synchronous systems - systems that work with quorums - and they require that n − f replicas to set an message order and execute them.

In terms of programming model, the SMR on the server side and the service components can set and get the states. They have a key-value store as a memory table, and a disk log as a permanent file. When a read command is sent, you from read the memory table, and when the write command is sent, you write in the memory table and append the operation in the disk log.

If you use stable logging, writing to a disk cost time. The values below show the throughput of writing (4kB/sec) that is possible to do.

Memory 4772 w/sec
Async Disk 4312 w/sec
Sync Disk 63 w/sec
Sync SSD 1017 w/sec

We can see, they it is possible to write 4772 times per second into memory. Writing into an async disk takes up 4312 operations per second. The problem of writing in asynchronously is, if you are really unlucky, you can loose some operations. Buying SSDs cost a lot, and it solves partially the problem. The idea of the paper is to close the value as much as possible to the values of writing into memory.

If you have checkpoints, you need to control the size of the log for two reasons:

  1. We need to control the size of the log to be stored.
  2. It takes time to recover the log.

The solution is to have periodic snapshots. In SMR, the normal way is to take periodic snapshots in some log size.

When the snapshots happen, the service becomes slow and the system can become unavailable to accept client requests. A solution to this is to use copy-on-write 2, but this adds complexity to the code. Another solution is to use Fuzzy snapshots 3, but it requires additional disks.

When it is necessary to do a state transfer because a replica crashed and recovered, another replica is picked up to do the state transfer and the service must be stopped because we need a quorum to execute requests. A way to solve this is to have additional replicas to deal with recoveries, or avoiding answering client requests.

When we integrate the referred problems and solutions to the code, it can make the code very complex.

In the paper they show several solutions to take care of the same problem without changing the SMR code. Three new techniques are proposed, like parallel logging, sequential checkpoints and a collaborative state transfer.

Parallel logging

To have parallel logging, it is necessary that a reply should only be sent after the state and the log are updated. Also the disk bandwidth must be very good (it takes approximately the same time to write 1kB or 1MB).

What it is proposed is to have an agreement layer that keep ordering requests. The requests are batched, and, at same point, they are sent to a thread that appends the values to the memory table and to the disk log. After the thread finishes, the result is replied to the client.

Sequential checkpoints

Normally, checkpoints happen in a synchronized way - all replicas take the same checkpoint at the same time. The benefit from this solution is that it is very easy to recover a machine with the correct checkpoint. The drawback is that it can make the system unavailable to answer client requests. The authors propose the sequential checkpoint - each server do the checkpoint at a time and there are more than one replica doing checkpoint at the same time - in order to avoid perturbations of the system during this time. The drawback of the sequential checkpoint is that it complicates state transfer.

Collaborative State Transfer

The idea is to ask a several replicas sending parts of the current state. One replica will send the checkpoint, and the remaining ones will send different parts of the log. We can see the optimized solution in the figure (b). In the General CST, just one replica must send the state to the recovered host.

Evaluation

They implemented these features in a durability layer on BFT-SMaRt4 replication library.

In figure 9, they show the number of operations that happen in parallel logging. Since they are stressing the system to its maximum, we can see that with parallel logging the results are similar as a pure memory system (red line in Figure 9 (a)). Also, for large writes, the parallel logging in Disk can be slightly better than with SSD, because the disks absorb more data and can reach pure memory throughput. But, SSDs win in terms of latency.

In the Figure 10, we can see the throughput during sequential checkpointing. In these graphs there are 2 execution: one in high load, and another in medium load for 500MB or 1GB state. Just focusing in the high load in Figure 10 (b), we can see 2 perturbations in the logging related to lagging replica synchronization. In this case, there are some replicas that are in front of others, and so the system becomes slow so that the lagging replica start to pick-up the pace and participate on the agreement. In medium load, this problem never happens because there is always time for the replicas to keep up with the system.

In the Figure 11, we can see what happens when a replica is recovering. In this case, they are considering f = 1 (n = 4) with state validation.

When they try collaborative state transfer (CST), they try to spread the workload among the 3 replicas. We can still see perturbations, but it is still a better result than using a single state transfer. The CST is represented with the blue line, and we can see a perturbation between 90s-120s (Figure 10 (b)). The CST perturbation is one third of the single state transfer, and this happens because, in this execution, they are trying to get part of the logs with the same size as the checkpoints.

In the usual state transfer protocol (you ask the state from a single replica), they get 24% of the normal throughput during state transfer, and 60% with CST. Summing up, it is much better to use CST than sequential state transfer, but the protocol is more complex.

In conclusion, durable SMR is a fundamental technique for critical services, and it is complex and difficult to do it right. In this paper, the authors showed a way to deal with the durability overhead without breaking modularity and the SMR model. They also presented 3 techniques like, parallel logging, sequential checkpoints, and collaborative state transfer.


  1. http://psteitz.blogspot.pt/2011/10/correlated-failure-in-distributed.html

  2. Copy-on-write is the name given to the policy that whenever a task attempts to make a change to the shared information, it should first create a separate - private - copy of that information to prevent its changes from becoming visible to all the other tasks.

  3. Zookeeper takes a snapshots at random time while the system is processing. This works in Zookeeper because they do not synchronize checkpoints with logging and it uses idempotent operations. Even when you are recovering a replica and you have an operation on the logging or on the checkpoint, you can execute the operation many times as you want, that the final state will always be the same.

  4. https://code.google.com/p/bft-smart/