Hakuna MapData! » Blog Archive » A user having surprising troubles running more resource-intensive Hive queries
rss

A user having surprising troubles running more resource-intensive Hive queries

| Posted in Troubleshooting |

The problem

A couple of months ago, one of our data analysts pernamently run into troubles when he wanted to run more resource-intensive Hive queries. Surprisingly, his queries were valid, syntactically-correct and run successfully on small data, but they just failed on larger datasets. On the other hand, other users were able to run the same queries successfully on the same large datasets. Obviously, it sounds like some permissions problem, however the user had right HDFS and Hive permissions.


The observations

We observed that when our user run the more resource-intensive Hive query (that spawns a lot of map tasks), the Hadoop cluster (especially HDFS daemons) experienced stability problems i.e. NameNode became less responsive and freezed, causing tens of DataNodes to lose connectivity and became marked “dead” (even though the DataNode daemons were still running on these servers).

The logs for NameNode showed a lot of warnings and exceptions thrown in the method org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getUnixGroups(final String user). Just to give some numbers – 14,592 warnings/exceptions were logged only during 8 min (4,768/min in a peak).

The method ShellBasedUnixGroupsMapping.getUnixGroups(final String user) is responsible for retrieving the list of groups that the given user belongs by running some Unix command on NameNode server.

package org.apache.hadoop.security;
...
import org.apache.hadoop.util.Shell;
import org.apache.hadoop.util.Shell.ExitCodeException;
 
/**
 * A simple shell-based implementation of {@link GroupMappingServiceProvider} 
 * that exec's the <code>groups</code> shell command to fetch the group
 * memberships of a given user.
 */
public class ShellBasedUnixGroupsMapping implements GroupMappingServiceProvider {
  ...
  /** 
   * Get the current user's group list from Unix by running the command 'groups'
   * NOTE. For non-existing user it will return EMPTY list
   * @param user user name
   * @return the groups list that the <code>user</code> belongs to
   * @throws IOException if encounter any error when running the command
   */
  private static List<String> getUnixGroups(final String user) throws IOException {
    String result = "";
    try {
      result = Shell.execCommand(Shell.getGroupsForUserCommand(user));
    } catch (ExitCodeException e) {
      // if we didn't get the group - just return empty list;
      LOG.warn("got exception trying to get groups for user " + user, e);
    }
 
    StringTokenizer tokenizer = new StringTokenizer(result);
    List<String> groups = new LinkedList<String>();
    while (tokenizer.hasMoreTokens()) {
      groups.add(tokenizer.nextToken());
    }
    return groups;
  }
}

This Unix command to find the user-group mapping is simply id -Gn username, according to the code snipped bellow:

package org.apache.hadoop.util;
...
abstract public class Shell {
  /** a Unix command to get a given user's groups list */
  public static String[] getGroupsForUserCommand(final String user) {
    //'groups username' command return is non-consistent across different unixes
    return new String [] {"bash", "-c", "id -Gn " + user};
  }
  ...
}

Security in Apache Hadoop

Normally (with the default settings), Apache Hadoop is a very trusty elephant. The username of user, who is submitting the job, is just taken from a client machine (and not verified at all, so one user can easily impersonate another e.g. by typing sudo -u other-user command, or creating a user account for “other-user” and accessing HDFS on behalf of this user), while the groupnames are resolved on NameNode server using just the Unix command id -Gn. If a user does not have an account on NameNode server, then the ExitCodeException is thrown by and caught in ShellBasedUnixGroupsMapping.getUnixGroups method. If your job is large (like the Hive query submitted by our user that consists of thousands of map tasks), then you will have many ExitCodeExceptions and the NameNode will stop responding and DataNodes will lose connectivity. If it takes more than 2 * heartbeat.recheck.interval + 10 * heartbeat.interval milliseconds (by default 10min:30sec), then DataNodes may become marked “dead”, when NameNode wakes up.

Possible fixes

How this problem could be solved? Obviously, a quick and dirty solution is to create an account on NameNode server for each user who accessing HDFS (directly, or by submitting MapReduce jobs to the cluster). However, for many reasons, you do not want to give everybody an account on NameNode server.

User-group resolution with AD/LDAP

Instead AD or LDAP could be used for resolving the group membership of users who access HDFS. Hadoop provides a couple of configuration settings hadoop.security.group.mapping.ldap.* (you can find them core-default.xml).

Alternatively, nss_ldap (it allows LDAP directory servers to be used as a primary source of name service information including e.g. users, hosts, groups) can be tried. In this case setting configuration options hadoop.security.group.mapping.ldap.* is not necessary.

We actually solved this issue by using nss_ldap, because the LDAP configuration settings hadoop.security.group.mapping.ldap.* did now work correctly in our case. The values that we wanted to use are as follows:

hadoop.security.group.mapping.ldap.search.filter.group (objectClass=posixGroup)
hadoop.security.group.mapping.ldap.search.attr.member memberUid
hadoop.security.group.mapping.ldap.search.attr.group.name cn

The problematic one is hadoop.security.group.mapping.ldap.search.filter.group, that, according to the documentation, currently does not support posixGroups (what we aimed to use) as a supported group class.

Strong authentication with Kerberos

One can go even one step further and use Kerberos. Although Kerberos is usually configured to take advantage of AD/LDAP servers (so the way how user-group mapping is resolved does not change), it will also provide a full-authentication of users accessing the cluster (so that user identity will be verified and nobody can easily impersonate other user).

Just one thing to note – installing and configuring Kerberos involves many tedious and difficult steps (some of them can be automated by Cloudera Manager). Basically, it is not just changing the configuration property hadoop.security.authentication from simple to kerberos. Before you make a decision to use Kerberos, make sure whether you really need it.

VN:F [1.9.20_1166]

Rate this post!

Rating: 0.0/5 (0 votes cast)

Comments

comments

Comments are closed.