Computer Science Talks: The HybrEx Model for Confidentiality and Privacy in Cloud Computing

Paper here
This paper was presented in HotCloud'11 and proposes a new execution model for confidentiality and privacy in cloud computing. This framework (no evaluation is shown) uses partioning of data and computation as a way to provide confidentiality and privacy.
The authors assume that many organizations have not widely adapted the use of clouds due to the concerns of confidentiality and privacy. Organizations prefer their internal infrastructure than a third-party cloud service. These concerns are well-founded. Researchers have shown that an outside attacker can extract unauthorized information in Amazon EC2. Other researchers found a vulnerability that allow user impersonation in Google Apps.
Encryption can only provide a limited guarantee, since any computation on encrypted data either involves decrypting the data or has yet to be practical even with fully homomorphic encryption¹.
In general, public clouds give better computing power than a private cloud. Normally, this is due to the fact that cloud providers have very high-performance datacenters.
One way to make the cloud more popular is to make it more secure. Recognizing this difficulty, they propose a model that uses public cloud for safe operations and integrate it with the private cloud.

The HybrEx model uses public cloud only for non-sensitive data and computation classified as public. E.g., when a company declares that some data or computation are non-sensitive. Realizing HybrEx model with MR, it is necessary to partition data and computation, use public and private clouds for single MR jobs over the Internet, and integrate the results. The main benefit of this model is to integrate with safety public and private clouds without concerns for confidentiality and privacy. They explore this model using the Hadoop MapReduce (MR) architecture.
MR is the most popular execution environment in cloud computing, and used to compute embarrassingly parallel problems.

The Fig. 2 shows several types of partitioning. I have to remind that the authors only specifies a way to do partitioning with HybrEx. They have never shown results of a functioning framework.

Horizontal Partitioning

There are two popular usage cases for the current public clouds. The first case is long-term archiving of an organization's data, where the organization encrypts their private data before storing it on a public cloud. The second case is exporting and importing data for running periodic MapReduce jobs in a public cloud. E.g., a company put data it Amazon Elastic MR, execute a job there and import the result back to its own private storage.
The authors say that, in the long-term archiving case (2b), HybrEx MapReduce can run Map tasks that encrypt private data in the private cloud, sanitize them optionally, transfer encrypted data to the public cloud via the Shuffle phase, and run Reduce tasks that store the data in the public cloud.

Vertical Partitioning

Vertical Partitioning (2c) in this model, it uses public cloud by executing MR jobs (e.g., two jobs) independently in public and private cloud and avoid inter-cloud shuffling of intermediate data. Each job consumes data from the respective cloud, and store the result in the same cloud. HybrEx MapReduce can run a MR job this way when the job can process private and public data in isolation. Unfortunetely, I cannot find a practical case. Just the theory.

Hybrid

In this model (2d), HybrEx uses MR in both private and public cloud in all three phases (map, shuffle and sort, and reduce).

As said before, it is necessary to overcome three main challenges to implement the HybrEx model — data partitioning, system partitioning, and integrity — and achieve integration with safety.

Data Partitioning

This model proposes the use of partitioning for confidentiality and privacy. So, it is necessary to know how to partition data. In this model, the data is labelled as public and private. MR will detect this labels and it will place the data in the right cloud and compute it accordingly.

System Partitioning

Since HybrEx model uses both private and public clouds, it is necessary to have some components in both clouds, to keep public components away from private data, and to reduce inter-cloud data transference if we really need to transfer data between clouds.
There is a master and a shadow master, the public counterpart, that have their own workers. There is no master-slave wide-area communication. What it can have a shuffle proxies that transfer intermediate data between clouds.
Shuffle proxies give the benefit to separate architectural components, where we can apply techniques (caching, aggregation, compression, and de-duplication of intermediate data) to reduce wide-area overhead.

Integration

It is necessary to provide integrity of the data and computation to an untrustworthy public cloud.
For computation integrity, HybrEx MapReduce checks the integrity of the results from the public cloud in two modes that provide different levels of fidelity.

Full integrity checking: The private clouds re-execute every map and reduce task that the public cloud has executed.
Quick integrity checking: The private cloud checks selectively the integrity of the results from the public cloud.

In 1., this is a way to enable auditing at later time. Obviously, it has a great overhead in doing a fully integrity.
In 2., the HybrEx MR checks the integrity at runtime for probabilistic detection of suspicious activities in the public cloud. E.g., for a MR job that counts words in a public document, we can either add new unique words to the document or select existing words at random from the document, and store them in the private cloud. Then, we can verify that the result from the public cloud contains the accurate counts of these words by running the same job in the private cloud and check the result in the selected regions.

Conclusion

This paper presents a model to allow MR and BigTable (I do not talk about BigTable here) computation in public and private clouds. We know that, in big data era, it is very expensive to have a private and internal cloud to deal with the terabytes, or petabytes that it must process. Therefore, it is a good solution to use cloud providers to process them. At the same time, there are some data that must not be disclosed. Thus, we need to protect privacy of this data, and still use the computational power of the public cloud to deal with rest of the job.
What this paper should have, was the real implementation and, respective evaluation, of this model. Unfortunately, it is not presented.

Homomorphic encryption is a form of encryption which allows specific types of computations to be carried out on ciphertext and generate an encrypted result which, when decrypted, matches the result of operations performed on the plaintext.↩

Computer Science Talks

Wednesday, 19 November 2014

The HybrEx Model for Confidentiality and Privacy in Cloud Computing