This paper presents Airavat, a MapReduce-based system which provides strong security and privacy guarantees for distributed computations on sensitive data. It integrates access control and differential privacy to protect data in the cloud.
Anonymization is insecure against attackers with external information and can lead to leaks of the data. Airavat uses access control policy and differential privacy to guarantee anonymization of the data.
Airavat can be used in the multiple applications that need to have the data available in the cloud and, at the same time, it is necessary to keep the privacy whilst performing data mining. Clouds with genomic data do not want to disclose their content. An algorithm called Cloudburst uses Airavat for mapping next-generation sequence data to the human genome.
Data providers control the security policy for their sensitive data. Users without security expertise can perform computations on the data, but Airavat confines these computations, preventing information leakage beyond the provider's policy. The prototype is efficient, with run times on Amazon's cloud computing infrastructure within 32% of a MapReduce system with no security.
Airavat can run on a cloud infrastructure, like AWS. The application trusts the cloud infrastructure, and the data provider that uploads the data, but it does not trust the computation provider because it can be malicious (even without any purpose) and perform damaging tasks.
Airavat is built in Java and it uses Jikes RVM and Sun JVM. Jikes RVM is not mature enough to run code as large and complex as the Hadoop framework. Therefore, it uses Hadoop's streaming feature to ensure that mappers run on Jikes and that most of the framework executes on Sun's JVM.
Programming model
They have split MR into untrusted mapper + trusted reducer. Several techniques are added to the mappers so that they can be trusted when running untrusted code. Reducers are trusted because they compute over data already formatted according to Airavat's policies.
Mappers
There are few challenges that it must be considered in the map side:
- The untrusted mapper code copies data and send it over the network.
- The output of the computation in the mapper code can be also an information channel. E.g., there could be a piece of information in the map output data that can identify the person.
Mandatory access control and differential privacy can be used to prevent leaks, and have end-to-end privacy. The former prevent leaks through storage channels like network connections, or files. The latter, prevent leaks through the output of the computation.
Airavat confines the untrusted code by adding mandatory access control (MAC) and policy. The Airavat runs over SELinux ( Linux kernel security module that provides mechanisms for supporting access control security policies), and this OS allows a user to define policies to create trusted and untrusted domains in order to restrict interactions. E.g., mappers that run untrusted code cannot access the network.
Airavat labels the input, intermediate values, and output data, and every time a user access a label, there will be security checks so that untrusted code do not leak any data. Just the reduce output is not labelled.
Access control solves many problems, but it is not enough. The output of the data at the end of the execution can also disclose private data by mistake when the label is removed. So, it is necessary mechanisms to enforce that the output does not violate an individual's privacy - differential privacy.
A mechanism is differentially private if every output is produced with similar probability whether any given input is included or not. A way to do is is adding noise to the calculation. E.g., the result of the computation f
over x
will be f(x)+noise
. Differential privacy uses the notion of function sensitivity. A function sensitivity measures the maximum change in the function's output when any single item is removed from or added to its dataset. Differential privacy works better with low-sensitivity computations. The sensitivity computation allows that slightly changed input will produce similar output, so that the malicious user cannot disclose data by perfidious data comparisons.
Mappers in Airavat can be any piece of Java code. A range of mapper outputs must be declared in advance before starting a job to estimate the sensitivity and determine how much noise is added to outputs to ensure differential privacy. Also, if a mapper produces a value outside the range, it is replaced by a value inside the range and the user is never notified in order prevent to possible information leak.
Airavat ensures that mapper invocations are independent - a single input record is allowed to affect the key/value pairs output by the mapper. It is not allowed that a mapper may store an input and use it later when processing another input. If this happened, sensitivity would go totally wrong because one input could affect other outputs. They modified the JVM to enforce mapper independence. Each object is assigned an invocation number, and JVM prevents reuse objects from previous invocation.
Reducers
Airavat trusts reducers because they work with the output of trusted mappers, and they will ensure differential privacy for the reducer's output. Reducers are used for general computation, and so the exact sensitivity for differential privacy can be the number of distinct keys in the map output. There is no need to estimate the sensitivity before executing the reduce tasks like in the map side.
Using differential privacy in the reducer output ensures data privacy when the Airavat's labels are removed.
Airavat + SELinux
Mandatory access control (MAC) is a useful building block for distributed computations. MAC-based operating systems enforce a single access control policy for the entire system. This policy, which cannot be overridden by users, prevents information leakage via storage channels such as files (e.g., files in HDFS), sockets, and program names. Mandatory access control requires that only someone who has access rights to all inputs should have access rights to the output.
SELinux is a Linux kernel security module that provides a mechanism for supporting access control security policies using mandatory access controls (MAC).
Other studies, like in PINQ, or other algorithms (e.g. 1, can work in a normal operating system. In the paper, Airavat works over SELinux distribution to control the access rights to the data to prevent data leakages.
Airavat trusts the cloud provider and the cloud-computing infrastructure. It assumes that SELinux correctly implements MAC and relies on the MAC features. Whilst running malicious computations with untrusted mappers that has full control over the mapper code, it could be possible to leak information in the keys. Therefore, Airavat never outputs keys produced by untrusted mappers. Instead, noise response is returned.
For example, Airavat can be used to compute the noisy answer to the query "What is the total number of iPods and pens sold today?" because the two keys iPod and pen are declared as part of the computation. The query "List all items and their sales" is not allowed in Airavat, unless the mapper trusted. The reason is that a malicious mapper can leak information by encoding it in item names. Trusted Airavat reducers always sort keys prior to outputting them. Therefore, a malicious mapper cannot use key order as a channel to leak information about a particular input record.
A malicious mapper may attempt to encode information by emitting a certain combination of values associated with different keys. Trusted reducers use the declared output range of mappers to add sufficient noise to ensure differential privacy for the outputs. In particular, a combination C
of output values across multiple keys does not leak information about any given input record r
because the probability of Airavat producing C
is approximately the same with or with r
in the input dataset. In other words, a difference of one output in the two queries will produce equivalent results.
Conclusion
In summary, MapReduce was modified to support mandatory access control, it was created SElinux policy (domains for trusted and untrusted programs), and the JVM was modified to enforce mapper independence.
Having to change the JVM to run the Airavat may seems that it is not very wise, because it is easier to restart the JVM ( maybe a run through the garbage collection) before running every map task. They prefer this way because this option would give higher overhead, but this is a reasonable cost if we want high security.
Related work
See PINQ
No comments:
Post a Comment