02 February 2022

Encryption with HDFS and kerberos

We at databloom.ai deal with a large ecosystem of big data implementations, most notably HDFS with encryption on flight and rest. We also see a lot of misconfigurations and want to shed some light into the topic with this technical article. We use plain Apache Hadoop, but the same technical background works for other distributions like Cloudera.

Encryption of data was and is the hottest topic in terms of data protection and prevention against theft, misuse and manipulation. Hadoop HDFS supports full transparent encryption in transit and at rest [1], based on Kerberos implementations, often used within multiple trusted Kerberos domains.

Technology

Hadoop KMS provides a REST-API, which has built-in SPNEGO and HTTPS support, comes mostly bundled with a pre-configured Apache Tomcat within your preferred Hadoop distribution.
To have encryption transparent for the user and the system, each encrypted zone is associated with a SEZK (single encryption zone key), created when the zone is defined as an encryption zone by interaction between NN and KMS. Each file within that zone will have its own DEK (Data Encryption Key). This behavior is fully transparent, since the NN directly asks the KMS for a new EDEK (encrypted data encryption key) encrypted with the zones key and adds them to the file’s metadata when a new file is created.


How the encryption flow in HDFS with kerberos works:


encryption flow in HDFS with kerberos

Explanation

When a client wants to read a file in an encrypted zone, the NN provides the EDEK together with a zone key version and the client asks the KMS to decrypt the EDEK. If the client has permissions to read that zone (POSIX), the client will use the provided DEK to read the file. Seen from a DFS node perspective, that datastream is encrypted and the nodes only see an encrypted data stream.

Setup and Use

Hadoop KMS is a cryptographic key management server based on Hadoop's KeyProvider API and was first introduced in Hadoop 2.7.
Enabling KMS in Apache Hadoop takes a few lines of configuration, important to know that KMS doesn’t work without a working Kerberos implementation. Additionally, there are more configuration parameters which need to be known, especially in a multi-domain Kerberos environment - depending on the multi-homed setup of a scaled cluster setup [3].

First, KMS uses the same rule based mechanism as HDFS uses when a trusted kerberos environment is used. That means the same filtering rules as existent in core-site.xml need to be added to kms-site.xml to get the encryption for all trusted domains working. This has to be done per:

<property>
 <name>hadoop.kms.authentication.kerberos.name.rules</name>
  <value>RULE:[1:$1@$0](.*@\QTRUSTED.DOMAIN\E$)s/@\QTRUSTED.DOMAIN\E$//
RULE:[2:$1@$0](.*@\QTRUSTED.DOMAIN\E$)s/@\QTRUSTED.DOMAIN\E$//
RULE:[1:$1@$0](.*@\QMAIN.DOMAIN\E$)s/@\QMAIN.DOMAIN\E$//
RULE:[2:$1@$0](.*@\QMAIN.DOMAIN\E$)s/@\QMAIN.DOMAIN\E$//
DEFAULT</value>
</property>


per kms-site.xmlThe terms trusted.domain / main.domain are placeholders, describing the original and the trusted kerberos domain. The use from an administrative standpoint is straightforward:

hadoop key create KEYNAME #(one time key creation)
hadoop fs -mkdir /enc_zones/data
hdfs crypto -createZone -keyName KEYNAME -path /enc_zones/data
hdfs crypto -listZones

First we create a key, then the directory we want to encrypt in HDFS and encrypt this with the key we created first.
This directory is now only accessible by users given access per HDFS POSIX permissions. Others aren’t able to change or read files. To give superusers the possibility to create backups without de- and encrypt, a virtual path prefix for distCp (/.reserved/raw) [2] is available. This prefix allows the block-wise copy of encrypted files, for backup and DR reasons.

The use of distCp for encrypted zones can cause some mishaps. Highly recommended is to have identical encrypted zones on both sides to avoid problems later. A potential distCp command for encrypted zones could look like:

hadoop distcp -px hdfs://source-cluster-namenode:8020/.reserved/raw/enc_zones/data hdfs://target-cluster-namenode:8020/.reserved/raw/enc_zones/data

[1] https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html
[2] https://hadoop.apache.org/docs/r2.7.2/hadoop-distcp/DistCp.html
[3] https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html

most read