Computer Science Talks: January 2015

Friday, 16 January 2015

Access control is any mechanism by which a system grants or revokes the right to access some data, or perform some action. Traditional access control mechanisms are all-or-nothing; once an application has the right to read a file, it can do anything with that file's data. In contrast, Differential Information Flow Control (DIFC) enforces more powerful policies, such as permitting an application to read a file, but disallowing the broadcast of the contents of that file over an unsecured network channel.

As an example of the DIFC model, consider Alice and Bob who want to schedule a meeting while keeping their calendars mostly secret. Alice and Bob each place a secrecy label on their calendar file, and then only a thread with those secrecy labels can read it. Once a thread has a secrecy label, it has been tainted by that label, and can no longer write to an unlabeled output, such as standard output or the network. If the thread has the capability to declassify the information, it may remove the secrecy label and then write the data to an unlabeled output. In the calendar example, the program obtains both Alice and Bob's secrecy label to read both calendar files, but then it cannot remove the labels. When the thread is ready to output an acceptable meeting time, it must call a function that then declassifies the result. The declassification function checks that its output contains no secret information. For example, the output is simply a date and does not include other information of Bob's calendar.

In a DIFC system, any principal can create a new tag for secrecy or integrity. Principals assign labels to data objects - data structures (arrays, list, etc...) and system resources (files and sockets). Secrecy label will prevent the data to be disclosed. Integrity label will assure that the data is not modified since it was endorsed by the principal. For example, a web application might create one secrecy tag for its user database and a separate secrecy tag for each user's data. The secrecy tag on the user database will prevent authentication information from leaking to the network. The tags on user data will prevent a malicious user from writing another user's secret data to an untrusted network connection. Also, if Microsoft endorse a data file and integrity is preserved, a user can trust the file content if it trust Microsoft label.

Programmers express security policies by labeling data with secrecy and integrity, and access labeled data scoped in security regions.

It exists several applications that implements the DIFC model, like HiStar, Flume, and Laminar. I will just focus on Flume and Laminar.

Laminar

Laminar is a Decentralized information flow control (DIFC) that combines the language-level (PL) and operating-system level (OS). It is the first system to implement decentralized information flow control using a single set of abstractions for OS resources and heap-allocated objects.

The Laminar OS extends a standard operating system with a Laminar security module for information flow control. The Laminar OS security module governs information flows through all standard OS interfaces, including through devices, files, pipes and sockets. The OS regulates communication between threads of the same or different processes that access the labeled or unlabeled system resources or that use OS inter-process communication mechanisms, such as signals. OS enforcement applies to all applications, preventing unlabeled or non-Laminar applications from circumventing the DIFC restrictions.

The Laminar VM regulates information flow between heap objects and between threads of the same process via these objects. These flows are regulated by inserting dynamic DIFC checks in the application code. Because the Laminar VM regulates these flows within the address space, the OS allows data structures and threads to have heterogeneous labels. All threads in multithreaded processes without a trusted VM must have the same labels and capabilities.

The Laminar OS exports security system calls to the trusted VM for capability and label management. The VM and the OS do not allow code outside the security region to access labeled data objects. During the execution of a security region, the VM gives the thread the labels and capabilities of the security region so that the OS can mediate access to system resources according to the security region's labels. Security regions are not visible to the OS, so the thread itself must have the labels and capabilities that is given by the VM. At the end of the security region, the VM restores the thread's original capabilities and labels.

It is necessary to modify Jikes RVM and the Linux operating system so that Laminar can provide DIFC. Jikes RVM is a Java VM implemented in C++. Airavat uses Laminar to provide security policies. Laminar is publicly available the the Jikes Archives.

Flume

Flume is an open source web application secure infrastructure based of DIFC model. Flume's design is a component within the kernel rather than an entire Operating System like HiStar. It is immediately visible that, any process running outside of Flume is vulnerable because of this. Currently there are two different implementations of Flume, one for Linux and one for OpenBSD. The Linux implementation runs as a component within the kernel, while the OpenBSD version utilizes systrace system calls.

A typical Flume application consists of processes of two types. Untrusted processes do most of the computation. They are constrained by, but possibly unaware of, DIFC controls. Trusted processes, in contrast, are aware of DIFC and set up the privacy and integrity controls that constrain untrusted processes. Trusted processes also have the privilege to selectively violate classical information flow control - for instance, by declassifying private data (perhaps to export it from the system), or by endorsing data as high integrity.

Flume is embedded in FlumeWiki, which is created by MoinMoin. Flume can be 34% slower in writes than Moin is, and 43% slower than reads than Moin is.

Linux Secure Modules (LSM)

Both Flume and Laminar uses LSM to hook into the kernel to allow customization authorization rules. LSM was designed to provide the specific needs of everything needed to successfully implement a mandatory access control module, while imposing the fewest possible changes to the Linux kernel.

Flume's LSM policy disallows all direct access to file systems by confined processes. Fork is blocked in Flume, due to the fact that all process spawning must occur from within the dedicated spawner

Laminar uses LSM to intercept inode and file accesses, which are used to perform operations on files and file handlers, and it do a straigthforward check of the secrecy rules and labels.

Monday, 12 January 2015

Perspectives on the CAP theorem

Paper here

This paper is a reflection about the CAP theorem and it places the theorem within a broader context of distributed computing theory.

CAP theorem appeared in the context of web services. A web service is implemented by a set of servers, perhaps distributed over a set of geographically distant data centers. The CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

Consistency (all nodes see the same data at the same time)
Availability (a guarantee that every request receives a response about whether it succeeded or failed)
Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

There is a trade-off that has been much discussed ever since: the impossibility of guaranteeing both safety, and liveness in an unreliable distributed system.

Consistency (as defined in the CAP Theorem) is a classic safety property: every response sent to a client is correct. I remember that an algorithm is safe or consistent if nothing bad ever happens. In case of web services, consistency means to return the right response to the client request. Thus, consistency depends on the service provided.

Availability is a classic liveness property: eventually, every request receives a response. An algorithm is live if eventually something good happen. Availability simply means that each request eventually receive a response. Obviously, a fast response is better than a slow response, but this property can be acceptable, or not, according to the system requirements. In a real-time system, a response that is late can be sufficient to create problems. In a purchase in the Amazon site, this is acceptable.

Partitions, crash failures, message loss, malicious attacks can turn a system unreliable. The CAP theorem only cares about network partitions. Network partitions can happen because some servers become unreachable, or some router partitioned the network. This is a common failure that an effect on distributed systems.

The paper states that it is impossible to achieve both consistency and availability in an unreliable distributed system, and it is necessary to sacrifice one of these two properties, or even both properties.

There are systems that guarantee strong consistency and provide best effort availability.
There are systems that guarantee availability, and provide best effort consistency.
Some systems may sacrifice both consistency and availability.

This sacrifice is acceptable if the characteristics of the service that is designed accepts it. If not, the service is not feasible.

Tuesday, 6 January 2015

Privacy Integrated Queries (PINQ)

Paper here

PINQ is a LINQ-like¹ API for computing on privacy-sensitive data sets, while providing guarantees of differential privacy for the underlying records. It is a trustworthy platform for privacy-preserving data analysis, and provides private access to arbitrarily sensitive data. A user just need to write a declarative program and it does not need to worry about the privacy of the data. To guarantee that querying will not disclose data, there is the PINQ layer that is an interface that allows to do some statistical query.

There are a sheer number of research papers on privacy-preserving data analysis, resulting in many distinct approaches. Each work needs a substantial effort in design, analysis, and implementation. The programs written in PINQ do not.

There are 2 approaches to protect data against people that what to check all the data besides differential privacy:

Using anonymization allows that the data provider does not need to trust or understand the analyst that checks the data. The drawback is that the analyst does not get access to the rich and high fidelity data. They have to work with the data that is reduced, specially in quality. It is not easy to sanitize data.
The analyst sends the analysis to the data set, and it runs raw data. The drawback is that the provider must allow the analyst to execute together with the full data.

With PINQ, the analysis run with the full data and the provider does not need to understand or trust the analyst. PINQ is an intermediate layer that the analyst interact to get the results he want. The PINQ will guarantee that the responses does not disclose private information.

To help understanding the importance of PINQ, we must look to a LINQ example that count the number of persons that like the sport cricket. Lets suppose that the program was written by an analyst (or user).

// Open sensitive data set with security
IQueryable<SearchRecord> queryLog = OpenSecretData(password);

// Group queries by User and identify cricket fans who queried the word at least 5 times
var bigUsers = queryLog.Where( x=> x.Query == "cricket")
                       .GroupBy( x=>x.User)
                       .Where(x=> x.Count() > 5);

// Transform each cricket fan into the Zip code by IP address
var userZIPs = bigUsers.Join(IPandZIP, x=>x.IP, y=>y.IP, (x,y) => y.ZIPCode);

// Count up ZIP codes containing at least 10 cricket fans
var PopularZIPs = userZIPs.GroupBy(x=>x)
                          .Where(x=>x.Count() > 10);

Console.WriteLine(PopularZIPs.Count());

The userZIPs attribution contains the IP address and Zip code of the users. This is considered sensitive data that should never get to the user side.

Now, lets look to the PINQ example.

// Open sensitive data set with security
PINQueryable<SearchRecord> queryLog = OpenSecretData(password);

// Group queries by User and identify cricket fans who queried the word at least 5 times
var bigUsers = queryLog.Where( x=> x.Query == "cricket")
                       .GroupBy( x=>x.User)
                       .Where(x=> x.Count() > 5);

// Transform each cricket fan into the Zip code by IP address
var userZIPs = bigUsers.Join(IPandZIP, x=>x.IP, y=>y.IP, (x,y) => y.ZIPCode);

// Count up ZIP codes containing at least 10 cricket fans
var PopularZIPs = userZIPs.GroupBy(x=>x)
                          .Where(x=>x.Count() > 10);

Console.WriteLine(PopularZIPs.Count(є));

The є (Epsilon) guarantees the differential privacy to the set PopularZIPs. If I put є=0, it means 0 accurate, and the noise will be unbounded. Most realistic is using 0.1. The є depends on the sensitivity of the data and how useful is the analysis. We could test it with different values, and see how it costs in terms of privacy. You can look this paper ² and get more information about choosing the Epsilon.

You can question why the PINQ example still uses the sensitive fields IP and the ZIPCode? Should not this fields be removed, or masked by PINQ?

The main point of PINQ is that you are allowed to use these sensitive fields, but because PINQ perturbs aggregates before they are returned to the user, it still provides differential privacy. That is, its privacy guarantees are not the result of avoiding sensitive fields, but rather about writing queries that the data provider can be sure will have differential privacy guarantees.

Transformations and aggregations

Transformations and aggregation must respect differential privacy. Here is some examples of transformations in LINQ:

`Where` and `Select`

IQueryable<T> Where(Expression<Func<T,bool>> predicate) IQueryable<S> Select<S>(Expression<Func<T, S>> function)

Output each changes by at most one with a change to input. We can query as many times as we like, that we do not compromise subsequent queries.

`GroupBy`

IQueryable<IGrouping<K,T>> GroupBy<K>(Expression<Func<T,K>> k)

Changing an input record can change at most two output records. The output is a coalition of two objects. When I use group by is a set, I am taking the element out of that set, and adding to a group object. Thus, 2 objects are modified. If we have a query with a chain of GroupBy, this increases by a factor of 2. We can break differential privacy by applying a chain of GroupBy.

`Join`

IQueryable<R> Join<S,K,R>(IQueryable<S> innerTable,
                          Expression<Func<T,K>> outerKey,
                          Expression<Func<S,K>> innerKey,
                          Expression<Func<T,S,R>> reducer)

Joins are a binary transformation that takes 2 inputs and 2 keys selection functions, and will output a pair of records using the reduction function. From a privacy point of view this is very dangerous. E.g., if I have a yellow page in one hand, and my medical record on the other hand, and I join them together, I splatter my medical record in each of the yellow pages entries.

A PINQueryable contains only two member variables: IQueryable source; // any LINQ data source PINQAgent agent; // acts as privacy proxy

The PINQAgent is a very important part of PINQ, controlling how much accuracy the analyst should have access to. The agent controls a resource, the units of differential privacy, and is given the opportunity to reject any depletion of this resource. It accepts or rejects requests for additional epsilon.

You can check more information about PINQ API here.

LINQ is a database style access language for .NET applications.↩
Differential Privacy: An Economic Method for Choosing Epsilon, Justin Hsu et al.↩

Friday, 2 January 2015

Airavat: Security and Privacy for MapReduce

Paper Here

This paper presents Airavat, a MapReduce-based system which provides strong security and privacy guarantees for distributed computations on sensitive data. It integrates access control and differential privacy to protect data in the cloud.

Anonymization is insecure against attackers with external information and can lead to leaks of the data. Airavat uses access control policy and differential privacy to guarantee anonymization of the data.

Airavat can be used in the multiple applications that need to have the data available in the cloud and, at the same time, it is necessary to keep the privacy whilst performing data mining. Clouds with genomic data do not want to disclose their content. An algorithm called Cloudburst uses Airavat for mapping next-generation sequence data to the human genome.

Data providers control the security policy for their sensitive data. Users without security expertise can perform computations on the data, but Airavat confines these computations, preventing information leakage beyond the provider's policy. The prototype is efficient, with run times on Amazon's cloud computing infrastructure within 32% of a MapReduce system with no security.

Airavat can run on a cloud infrastructure, like AWS. The application trusts the cloud infrastructure, and the data provider that uploads the data, but it does not trust the computation provider because it can be malicious (even without any purpose) and perform damaging tasks.

Airavat is built in Java and it uses Jikes RVM and Sun JVM. Jikes RVM is not mature enough to run code as large and complex as the Hadoop framework. Therefore, it uses Hadoop's streaming feature to ensure that mappers run on Jikes and that most of the framework executes on Sun's JVM.

Programming model

They have split MR into untrusted mapper + trusted reducer. Several techniques are added to the mappers so that they can be trusted when running untrusted code. Reducers are trusted because they compute over data already formatted according to Airavat's policies.

Mappers

There are few challenges that it must be considered in the map side:

The untrusted mapper code copies data and send it over the network.
The output of the computation in the mapper code can be also an information channel. E.g., there could be a piece of information in the map output data that can identify the person.

Mandatory access control and differential privacy can be used to prevent leaks, and have end-to-end privacy. The former prevent leaks through storage channels like network connections, or files. The latter, prevent leaks through the output of the computation.

Airavat confines the untrusted code by adding mandatory access control (MAC) and policy. The Airavat runs over SELinux ( Linux kernel security module that provides mechanisms for supporting access control security policies), and this OS allows a user to define policies to create trusted and untrusted domains in order to restrict interactions. E.g., mappers that run untrusted code cannot access the network.

Airavat labels the input, intermediate values, and output data, and every time a user access a label, there will be security checks so that untrusted code do not leak any data. Just the reduce output is not labelled.

Access control solves many problems, but it is not enough. The output of the data at the end of the execution can also disclose private data by mistake when the label is removed. So, it is necessary mechanisms to enforce that the output does not violate an individual's privacy - differential privacy.

A mechanism is differentially private if every output is produced with similar probability whether any given input is included or not. A way to do is is adding noise to the calculation. E.g., the result of the computation f over x will be f(x)+noise. Differential privacy uses the notion of function sensitivity. A function sensitivity measures the maximum change in the function's output when any single item is removed from or added to its dataset. Differential privacy works better with low-sensitivity computations. The sensitivity computation allows that slightly changed input will produce similar output, so that the malicious user cannot disclose data by perfidious data comparisons.

Mappers in Airavat can be any piece of Java code. A range of mapper outputs must be declared in advance before starting a job to estimate the sensitivity and determine how much noise is added to outputs to ensure differential privacy. Also, if a mapper produces a value outside the range, it is replaced by a value inside the range and the user is never notified in order prevent to possible information leak.

Airavat ensures that mapper invocations are independent - a single input record is allowed to affect the key/value pairs output by the mapper. It is not allowed that a mapper may store an input and use it later when processing another input. If this happened, sensitivity would go totally wrong because one input could affect other outputs. They modified the JVM to enforce mapper independence. Each object is assigned an invocation number, and JVM prevents reuse objects from previous invocation.

Reducers

Airavat trusts reducers because they work with the output of trusted mappers, and they will ensure differential privacy for the reducer's output. Reducers are used for general computation, and so the exact sensitivity for differential privacy can be the number of distinct keys in the map output. There is no need to estimate the sensitivity before executing the reduce tasks like in the map side.

Using differential privacy in the reducer output ensures data privacy when the Airavat's labels are removed.

Airavat + SELinux

Mandatory access control (MAC) is a useful building block for distributed computations. MAC-based operating systems enforce a single access control policy for the entire system. This policy, which cannot be overridden by users, prevents information leakage via storage channels such as files (e.g., files in HDFS), sockets, and program names. Mandatory access control requires that only someone who has access rights to all inputs should have access rights to the output.

SELinux is a Linux kernel security module that provides a mechanism for supporting access control security policies using mandatory access controls (MAC).

Other studies, like in PINQ, or other algorithms (e.g. 1, can work in a normal operating system. In the paper, Airavat works over SELinux distribution to control the access rights to the data to prevent data leakages.

Airavat trusts the cloud provider and the cloud-computing infrastructure. It assumes that SELinux correctly implements MAC and relies on the MAC features. Whilst running malicious computations with untrusted mappers that has full control over the mapper code, it could be possible to leak information in the keys. Therefore, Airavat never outputs keys produced by untrusted mappers. Instead, noise response is returned.

For example, Airavat can be used to compute the noisy answer to the query "What is the total number of iPods and pens sold today?" because the two keys iPod and pen are declared as part of the computation. The query "List all items and their sales" is not allowed in Airavat, unless the mapper trusted. The reason is that a malicious mapper can leak information by encoding it in item names. Trusted Airavat reducers always sort keys prior to outputting them. Therefore, a malicious mapper cannot use key order as a channel to leak information about a particular input record.

A malicious mapper may attempt to encode information by emitting a certain combination of values associated with different keys. Trusted reducers use the declared output range of mappers to add sufficient noise to ensure differential privacy for the outputs. In particular, a combination C of output values across multiple keys does not leak information about any given input record r because the probability of Airavat producing C is approximately the same with or with r in the input dataset. In other words, a difference of one output in the two queries will produce equivalent results.

Conclusion

In summary, MapReduce was modified to support mandatory access control, it was created SElinux policy (domains for trusted and untrusted programs), and the JVM was modified to enforce mapper independence.

Having to change the JVM to run the Airavat may seems that it is not very wise, because it is easier to restart the JVM ( maybe a run through the garbage collection) before running every map task. They prefer this way because this option would give higher overhead, but this is a reasonable cost if we want high security.

See PINQ