Tuesday, 6 January 2015

Privacy Integrated Queries (PINQ)

Paper here

PINQ is a LINQ-like1 API for computing on privacy-sensitive data sets, while providing guarantees of differential privacy for the underlying records. It is a trustworthy platform for privacy-preserving data analysis, and provides private access to arbitrarily sensitive data. A user just need to write a declarative program and it does not need to worry about the privacy of the data. To guarantee that querying will not disclose data, there is the PINQ layer that is an interface that allows to do some statistical query.

There are a sheer number of research papers on privacy-preserving data analysis, resulting in many distinct approaches. Each work needs a substantial effort in design, analysis, and implementation. The programs written in PINQ do not.

There are 2 approaches to protect data against people that what to check all the data besides differential privacy:

  1. Using anonymization allows that the data provider does not need to trust or understand the analyst that checks the data. The drawback is that the analyst does not get access to the rich and high fidelity data. They have to work with the data that is reduced, specially in quality. It is not easy to sanitize data.
  2. The analyst sends the analysis to the data set, and it runs raw data. The drawback is that the provider must allow the analyst to execute together with the full data.

With PINQ, the analysis run with the full data and the provider does not need to understand or trust the analyst. PINQ is an intermediate layer that the analyst interact to get the results he want. The PINQ will guarantee that the responses does not disclose private information.

To help understanding the importance of PINQ, we must look to a LINQ example that count the number of persons that like the sport cricket. Lets suppose that the program was written by an analyst (or user).

// Open sensitive data set with security
IQueryable<SearchRecord> queryLog = OpenSecretData(password);

// Group queries by User and identify cricket fans who queried the word at least 5 times
var bigUsers = queryLog.Where( x=> x.Query == "cricket")
                       .GroupBy( x=>x.User)
                       .Where(x=> x.Count() > 5);

// Transform each cricket fan into the Zip code by IP address
var userZIPs = bigUsers.Join(IPandZIP, x=>x.IP, y=>y.IP, (x,y) => y.ZIPCode);

// Count up ZIP codes containing at least 10 cricket fans
var PopularZIPs = userZIPs.GroupBy(x=>x)
                          .Where(x=>x.Count() > 10);

Console.WriteLine(PopularZIPs.Count());

The userZIPs attribution contains the IP address and Zip code of the users. This is considered sensitive data that should never get to the user side.

Now, lets look to the PINQ example.

// Open sensitive data set with security
PINQueryable<SearchRecord> queryLog = OpenSecretData(password);

// Group queries by User and identify cricket fans who queried the word at least 5 times
var bigUsers = queryLog.Where( x=> x.Query == "cricket")
                       .GroupBy( x=>x.User)
                       .Where(x=> x.Count() > 5);

// Transform each cricket fan into the Zip code by IP address
var userZIPs = bigUsers.Join(IPandZIP, x=>x.IP, y=>y.IP, (x,y) => y.ZIPCode);

// Count up ZIP codes containing at least 10 cricket fans
var PopularZIPs = userZIPs.GroupBy(x=>x)
                          .Where(x=>x.Count() > 10);

Console.WriteLine(PopularZIPs.Count(є));

The є (Epsilon) guarantees the differential privacy to the set PopularZIPs. If I put є=0, it means 0 accurate, and the noise will be unbounded. Most realistic is using 0.1. The є depends on the sensitivity of the data and how useful is the analysis. We could test it with different values, and see how it costs in terms of privacy. You can look this paper 2 and get more information about choosing the Epsilon.

You can question why the PINQ example still uses the sensitive fields IP and the ZIPCode? Should not this fields be removed, or masked by PINQ?

The main point of PINQ is that you are allowed to use these sensitive fields, but because PINQ perturbs aggregates before they are returned to the user, it still provides differential privacy. That is, its privacy guarantees are not the result of avoiding sensitive fields, but rather about writing queries that the data provider can be sure will have differential privacy guarantees.

Transformations and aggregations

Transformations and aggregation must respect differential privacy. Here is some examples of transformations in LINQ:

Where and Select

IQueryable<T> Where(Expression<Func<T,bool>> predicate) IQueryable<S> Select<S>(Expression<Func<T, S>> function)

Output each changes by at most one with a change to input. We can query as many times as we like, that we do not compromise subsequent queries.

GroupBy

IQueryable<IGrouping<K,T>> GroupBy<K>(Expression<Func<T,K>> k)

Changing an input record can change at most two output records. The output is a coalition of two objects. When I use group by is a set, I am taking the element out of that set, and adding to a group object. Thus, 2 objects are modified. If we have a query with a chain of GroupBy, this increases by a factor of 2. We can break differential privacy by applying a chain of GroupBy.

Join

IQueryable<R> Join<S,K,R>(IQueryable<S> innerTable,
                          Expression<Func<T,K>> outerKey,
                          Expression<Func<S,K>> innerKey,
                          Expression<Func<T,S,R>> reducer)

Joins are a binary transformation that takes 2 inputs and 2 keys selection functions, and will output a pair of records using the reduction function. From a privacy point of view this is very dangerous. E.g., if I have a yellow page in one hand, and my medical record on the other hand, and I join them together, I splatter my medical record in each of the yellow pages entries.

A PINQueryable contains only two member variables: IQueryable source; // any LINQ data source PINQAgent agent; // acts as privacy proxy

The PINQAgent is a very important part of PINQ, controlling how much accuracy the analyst should have access to. The agent controls a resource, the units of differential privacy, and is given the opportunity to reject any depletion of this resource. It accepts or rejects requests for additional epsilon.

You can check more information about PINQ API here.


  1. LINQ is a database style access language for .NET applications.

  2. Differential Privacy: An Economic Method for Choosing Epsilon, Justin Hsu et al.

No comments:

Post a Comment