Healthcare organizations are adopting Hadoop-based solutions to address key challenges around unified data ingestion, transformations in batch or stream mode and using distributed storage and compute for querying and analytics. Our experience with a few leading healthcare organizations has shown that Hadoop can be an effective solution to complex data integration and processing challenges.
One of our customers, a benefits management company, faced a similar data management challenge due to multiple mergers, resulting in a diverse ecosystem of systems, users and data. We helped the company set up an enterprise Hadoop cluster using the Hortonworks distribution. They also needed to secure the Hadoop cluster, for user authentication and auditability. For this they configured Forest Active Directory (AD) based on Microsoft Windows Server and integrated the Hadoop cluster with the Forest AD. This helped during security audits and allowed them to manage users at the AD (like what they had been doing with the rest of their businesses). Additionally, Kerberos was used for authentication and Apache Ranger was used for authorization (along with the Forest ADs).
The integration of the Hadoop Clusters with the Enterprise AD led to several learnings. Here are 7 major steps we believe helped us execute this project successfully.
Step 1: Putting AD Domain Certificates in the Right Place
While integrating Kerberos with Active Directory server, the connection gets terminated if the AD domain certificates are missing in the Java TrustStore. Adding the domain certificates on the server does not solve the problem. The solution is to connect to the AD using the OpenSSL tool to extract the domain certificate and add it to all the require KeyStores.
Step 2: Integrating Kerberos with Multiple ADs
When user span across different AD, Kerberos needs to be integrated with all the AD, not just the Forest AD. The Forest AD does not look up users from the other trusted ADs. We had to configure the system to look up users from all the ADs including the Forest AD. This allowed any valid user across domains to receive an authentication token using their network credentials.
Step 3: Integrating Ranger with Forest AD
With Forest AD, the trusted AD users are added to the parent AD groups which show up as foreign principals, which means that you would have their Security Identifier (SID) in their AD common name format instead of human readable name.
Apache Ranger does not support integration with Forest AD and this was huge problem because the trusted AD users were not showing up properly. We developed a custom component to allow querying the forest AD user and group and if any users were found in the group having their common name as SIDs, they were transformed to their respective domain name.
Step 4: Allowing Ranger Policies to work with AD Groups
Apache Ranger allows creating resources-based services (HDFS, HBase, Hive, etc.) and access policies for those services. The access policies could be to ALLOW/DENY specific AD Users and Groups. With Forest AD, the access policies work well for AD users but not for AD Groups.
Additional configuration on the Hadoop HDFS was needed to allow fetching AD user and AD group information from the Forest AD. It also required a custom component to perform the SID-to-domain name mapping for foreign principals.
Step 5: Configuring ALLOW and DENY policies
Apache Ranger by default allows creating policies to ‘ALLOW’ access the resources like HDFS Folders, HBase Tables, Hive Tables etc. The customer also needed to ‘DENY’ access to some groups. A configuration was added to the cluster to let the administrator configure the ALLOW and DENY Ranger policies.
Step 6: Defining Best Practices for HDFS Ranger Policies
The HDFS was configured to allow the ACLs to be managed at the Hadoop file system level by Ranger. We followed the recommended best practices to make the life of Hadoop administrator easier and planned a proper security model for the Hadoop filesystem. Below are some of the key practices that were defined and followed:
- Identifying directories which need to be managed by HDFS native permissions (i.e. /tmp, /user). These are directories which are common across users.
- Identifying directories which need to be managed by Ranger policies. These could be application/tool specific folders and confidential data sets.
- Setting umask to 077 so that any new directories created would have explicit rights of the user who created it, this would also reduce the work for administrators. This would raise a question whether there is a need to provide execute permission but this is needed in secured environments where “dfs.permission” property is set to true.
Step 7: Enabling the Audit Mechanism in Ranger
Ranger provides an audit mechanism that allows Ranger plugins to send their audit events like whether access was granted or not, based on a defined Ranger policy. This helps the administrator track the user activities and data lineage if someone deleted or modified any directories or tables. This feature was enabled for the customer by integrating it with SOLR and HDFS, so that all the events were logged in there and it would allow the administrators to search or query over the audit events in the future.
Through this project, we found that Apache Ranger provides a very effective mechanism to centrally manage security across various services in a Hadoop cluster. The Apache Ranger integration with the Forest AD has been full of learning given the Hortonworks limitations with Forest AD and it has hardened the security of the Hadoop Clusters. This would allow the administrators to define enterprise-grade fine-grained access control with column level access, masking, folder/file access controls etc. and also help with the audit reports.