[综合]Apache Hadoop 2.2.0集群安装(2)[翻译]
NodeManager节点健康监控
hadoop提供一个检测一个节点健康状态的机制,那就是管理员可以配置NodeManager去周期性执行一个脚本。
管理员可以在这个脚本中做任何的状态监控从而决定此节点是否健康。如果某节点不健康了,那么他们会有一个标准的错误输出,NodeManager的脚本周期性检测输出,如果节点输出中包含了ERROR字符串,那么此节点会被上报为 unhealthy ,并且此节点会被ResourceManager放入黑名单。从而将不会有task被分配到此节点上,不过NodeManager 仍然会健康此节点,当此节点正常之后他将会被从ResourceManager 的黑名单中自动移除,节点的运行状况取决于输出,当他不正常的时候他仍然会在ResourceManager上展示。
如下参数为节点状况健康脚本的配置conf/yarn-site.xml:
yarn.nodemanager.health-checker.script.path | Node health script | Script to check for node's health status. |
yarn.nodemanager.health-checker.script.opts | Node health script options | Options for script to check for node's health status. |
yarn.nodemanager.health-checker.script.interval-ms | Node health script interval | Time interval for running health script. |
yarn.nodemanager.health-checker.script.timeout-ms | Node health script timeout interval | Timeout for health script execution. |
当一些物理磁盘出现坏道时监控程序不会提示错误。NodeManager 有能力对物理磁盘做周期性检测(特别是nodemanager-local-dirs and nodemanager-log-dirs)当目录损坏数达到配置的阀值(yarn.nodemanager.disk-health-checker.min-healthy-disks配置的)之后整个节点就会被标记为不正常的。同时这些信息也会上报给资源管理器(resource manager),检测脚本也会检测启动盘。
Slaves文件
通常你选择了一个机器做NameNode ,一个机器做ResourceManager,其他的做DataNode和NodeManager 也就是从节点。
把所有的从节点的ip或者hostname写在conf/slaves文件里,每个机器一行。
日志
Hadoop 用apache的log4j去访问Apache Commons Logging框架去记录日志。去修改 conf/log4j.properties 可以自定义自己的日志输出。
操作Hadoop集群
一旦配置文件都已经配置完成之后拷贝他们到所有机器的 HADOOP_CONF_DIR 目录
Hadoop启动
你需要启动hdfs和YARN
格式化一个新的分布式系统:
$ $HADOOP_PREFIX/bin/hdfs namenode -format <cluster_name>
在NameNode执行如下命令去启动hdfs:
$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode
在所有的从节点上执行如下命令启动DataNodes :
$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start datanode
在ResourceManager上执行如下命令去启动YARN
$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager
在所有的从节点上执行如下命令去启动NodeManagers :
$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager
单独启动一个web服务器,如果需要负载均衡的话那么在每个机子上都执行如下脚本:
$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh start proxyserver --config $HADOOP_CONF_DIR
在任何一台机子上执行如下命令去启动MapReduce JobHistory 服务:
$HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR
Hadoop集群关闭
在NameNode 节点上执行如下命令去关闭NameNode进程:
$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs stop namenode
在所有的从节点上执行如下脚本去停止DataNodes 进程:
$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs stop datanode
在ResourceManager 节点上执行如下命令可以停止ResourceManager 进程:
$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR stop resourcemanager
在所有从节点执行如下命令去停止NodeManagers 进程:
$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR stop nodemanager
在运行WebAppProxy 的节点上执行如下命令可以停止WebAppProxy 服务:
$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh stop proxyserver --config $HADOOP_CONF_DIR
在运行MapReduce JobHistory 服务的节点上执行如下命令去停止MapReduce JobHistory 服务:
$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh stop historyserver --config $HADOOP_CONF_DIR
Hadoop在安全模式下运行
本节将讲述一些在安全模式下运行的参数,安全模式是可靠的基于Kerberos协议认证的。
Hadoop进程的用户账户
确保HDFS和YARN进程是由不同的Unix用户启动的,如hdfs,yarn,并且MapReduce JobHistory 是由mapred启动的。
推荐他们都属于同一个组如Hadoop:
hdfs:hadoop | NameNode, Secondary NameNode, Checkpoint Node, Backup Node, DataNode |
yarn:hadoop | ResourceManager, NodeManager |
mapred:hadoop | MapReduce JobHistory Server |
HDSF和本地文件权限:
下表罗列出hdfs上的path和本地文件系统的推荐权限设置:
local | dfs.namenode.name.dir | hdfs:hadoop | drwx------ |
local | dfs.datanode.data.dir | hdfs:hadoop | drwx------ |
local | $HADOOP_LOG_DIR | hdfs:hadoop | drwxrwxr-x |
local | $YARN_LOG_DIR | yarn:hadoop | drwxrwxr-x |
local | yarn.nodemanager.local-dirs | yarn:hadoop | drwxr-xr-x |
local | yarn.nodemanager.log-dirs | yarn:hadoop | drwxr-xr-x |
local | container-executor | root:hadoop | --Sr-s--- |
local | conf/container-executor.cfg | root:hadoop | r-------- |
hdfs | / | hdfs:hadoop | drwxr-xr-x |
hdfs | /tmp | hdfs:hadoop | drwxrwxrwxt |
hdfs | /user | hdfs:hadoop | drwxr-xr-x |
hdfs | yarn.nodemanager.remote-app-log-dir | yarn:hadoop | drwxrwxrwxt |
hdfs | mapreduce.jobhistory.intermediate-done-dir | mapred:hadoop | drwxrwxrwxt |
hdfs | mapreduce.jobhistory.done-dir | mapred:hadoop | drwxr-x--- |
Kerberos Keytab文件:
HDFS:
NameNode 节点上的的keytab文件如下:
$ /usr/kerberos/bin/klist -e -k -t /etc/security/keytab/nn.service.keytab Keytab name: FILE:/etc/security/keytab/nn.service.keytab KVNO Timestamp Principal 4 07/18/11 21:08:09 nn/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 nn/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 nn/[email protected] (ArcFour with HMAC/md5) 4 07/18/11 21:08:09 host/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 host/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 host/[email protected] (ArcFour with HMAC/md5)
Secondary NameNode 的keytab文件如下:
$ /usr/kerberos/bin/klist -e -k -t /etc/security/keytab/sn.service.keytab Keytab name: FILE:/etc/security/keytab/sn.service.keytab KVNO Timestamp Principal 4 07/18/11 21:08:09 sn/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 sn/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 sn/[email protected] (ArcFour with HMAC/md5) 4 07/18/11 21:08:09 host/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 host/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 host/[email protected] (ArcFour with HMAC/md5)
DataNode 的keytab文件如下:
$ /usr/kerberos/bin/klist -e -k -t /etc/security/keytab/dn.service.keytab Keytab name: FILE:/etc/security/keytab/dn.service.keytab KVNO Timestamp Principal 4 07/18/11 21:08:09 dn/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 dn/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 dn/[email protected] (ArcFour with HMAC/md5) 4 07/18/11 21:08:09 host/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 host/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 host/[email protected] (ArcFour with HMAC/md5)
YARN:
ResourceManager 节点上的ResourceManager keytab文件如下:
$ /usr/kerberos/bin/klist -e -k -t /etc/security/keytab/rm.service.keytab Keytab name: FILE:/etc/security/keytab/rm.service.keytab KVNO Timestamp Principal 4 07/18/11 21:08:09 rm/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 rm/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 rm/[email protected] (ArcFour with HMAC/md5) 4 07/18/11 21:08:09 host/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 host/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 host/[email protected] (ArcFour with HMAC/md5)
NodeManager节点上的keytab文件如下:
$ /usr/kerberos/bin/klist -e -k -t /etc/security/keytab/nm.service.keytab Keytab name: FILE:/etc/security/keytab/nm.service.keytab KVNO Timestamp Principal 4 07/18/11 21:08:09 nm/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 nm/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 nm/[email protected] (ArcFour with HMAC/md5) 4 07/18/11 21:08:09 host/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 host/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 host/[email protected] (ArcFour with HMAC/md5)
MapReduce JobHistory Server:
MapReduce JobHistory Server keytab 文件如下:
$ /usr/kerberos/bin/klist -e -k -t /etc/security/keytab/jhs.service.keytab Keytab name: FILE:/etc/security/keytab/jhs.service.keytab KVNO Timestamp Principal 4 07/18/11 21:08:09 jhs/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 jhs/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 jhs/[email protected] (ArcFour with HMAC/md5) 4 07/18/11 21:08:09 host/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 host/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC) 4 07/18/11 21:08:09 host/[email protected] (ArcFour with HMAC/md5)
安全模式配置:
conf/core-site.xml:
hadoop.security.authentication | kerberos | simple is non-secure. |
hadoop.security.authorization | true | Enable RPC service-level authorization. |
conf/hdfs-site.xml:
NameNode配置:
dfs.block.access.token.enable | true | Enable HDFS block access tokens for secure operations. |
dfs.https.enable | true | |
dfs.namenode.https-address | nn_host_fqdn:50470 | |
dfs.https.port | 50470 | |
dfs.namenode.keytab.file | /etc/security/keytab/nn.service.keytab | Kerberos keytab file for the NameNode. |
dfs.namenode.kerberos.principal | nn/[email protected] | Kerberos principal name for the NameNode. |
dfs.namenode.kerberos.https.principal | host/[email protected] | HTTPS Kerberos principal name for the NameNode. |
Secondary NameNode配置:
dfs.namenode.secondary.http-address | c_nn_host_fqdn:50090 | |
dfs.namenode.secondary.https-port | 50470 | |
dfs.namenode.secondary.keytab.file | /etc/security/keytab/sn.service.keytab | Kerberos keytab file for the NameNode. |
dfs.namenode.secondary.kerberos.principal | sn/[email protected] | Kerberos principal name for the Secondary NameNode. |
dfs.namenode.secondary.kerberos.https.principal | host/[email protected] | HTTPS Kerberos principal name for the Secondary NameNode. |
DataNode配置:
dfs.datanode.data.dir.perm | 700 | |
dfs.datanode.address | 0.0.0.0:2003 | |
dfs.datanode.https.address | 0.0.0.0:2005 | |
dfs.datanode.keytab.file | /etc/security/keytab/dn.service.keytab | Kerberos keytab file for the DataNode. |
dfs.datanode.kerberos.principal | dn/[email protected] | Kerberos principal name for the DataNode. |
dfs.datanode.kerberos.https.principal | host/[email protected] | HTTPS Kerberos principal name for the DataNode. |
conf/yarn-site.xml:
WebAppProxy:
WebAppProxy在应用和用户之间提供了一个web输出,如果是在安全模式下那么当用户不安全访问的时候就会被警告,跟普通的web应用一样。
yarn.web-proxy.address | WebAppProxy host:port for proxy to AM web apps. | host:port if this is the same as yarn.resourcemanager.webapp.address or it is not defined then the ResourceManager will run the proxy otherwise a standalone proxy server will need to be launched. |
yarn.web-proxy.keytab | /etc/security/keytab/web-app.service.keytab | Kerberos keytab file for the WebAppProxy. |
yarn.web-proxy.principal | wap/[email protected] | Kerberos principal name for the WebAppProxy. |
LinuxContainerExecutor:
YARN框架使用的 ContainerExecutor 定义了多少个容器被启动和控制。
如下在Hadoop YARN是也是有效的:
DefaultContainerExecutor | The default executor which YARN uses to manage container execution. The container process has the same Unix user as the NodeManager. |
LinuxContainerExecutor | Supported only on GNU/Linux, this executor runs the containers as the user who submitted the application. It requires all user accounts to be created on the cluster nodes where the containers are launched. It uses a setuid executable that is included in the Hadoop distribution. The NodeManager uses this executable to launch and kill containers. The setuid executable switches to the user who has submitted the application and launches or kills the containers. For maximum security, this executor sets up restricted permissions and user/group ownership of local files and directories used by the containers such as the shared objects, jars, intermediate files, log files etc. Particularly note that, because of this, except the application owner and NodeManager, no other user can access any of the local files/directories including those localized as part of the distributed cache. |
构建LinuxContainerExecutor 执行如下脚本:
$ mvn package -Dcontainer-executor.conf.dir=/etc/hadoop/
通过 -Dcontainer-executor.conf.dir传过来的路径集群节点上必须有且是本地的路径,执行文件必须在$HADOOP_YARN_HOME/bin中有。执行文件必须有权限:6050 or --Sr-s--- ,NodeManager 的unix用户必须同组,这个组必须是个特殊的组,如果其他应用程序具有这个组的权限那么他将是不安全的,这个组的名称需要在 yarn.nodemanager.linux-container-executor.group 属性中配置涉及到 conf/yarn-site.xml and conf/container-executor.cfg两个文件。
如:NodeManager 的启动用户为 yarn 为 hadoop组, users组中有如下两个用户 yarn 和 alice(应用程序提交者) 同时 alice 不属于 hadoop组如上所述那么setuid/setgid 执行文件必须设置权限为 6050 or --Sr-s--- , yarn 用户和 hadoop 组(这样 alice 就不能执行了)。
LinuxTaskController 需要的目录 yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs他们的权限设置为755 。
conf/container-executor.cfg:
执行文件需要一个配置文件container-executor.cfg上面mvn提到的,此文件必须为运行NodeManager 的用户所有(如上面的 yarn ),任意组那么权限为:0400 or r--------.
执行文件需要下属参数在conf/container-executor.cfg配置,以key-value对出现,并且一行一个。
yarn.nodemanager.linux-container-executor.group | hadoop | Unix group of the NodeManager. The group owner of the container-executor binary should be this group. Should be same as the value with which the NodeManager is configured. This configuration is required for validating the secure access of the container-executor binary. |
banned.users | hfds,yarn,mapred,bin | Banned users. |
allowed.system.users | foo,bar | Allowed system users. |
min.user.id | 1000 | Prevent other super-users. |
LinuxContainerExecutor中涉及到的本地文件系统权限如下:
local | container-executor | root:hadoop | --Sr-s--- |
local | conf/container-executor.cfg | root:hadoop | r-------- |
local | yarn.nodemanager.local-dirs | yarn:hadoop | drwxr-xr-x |
local | yarn.nodemanager.log-dirs | yarn:hadoop | drwxr-xr-x |
ResourceManager配置:
yarn.resourcemanager.keytab | /etc/security/keytab/rm.service.keytab | Kerberos keytab file for the ResourceManager. |
yarn.resourcemanager.principal | rm/[email protected] | Kerberos principal name for the ResourceManager. |
NodeManager配置:
yarn.nodemanager.keytab | /etc/security/keytab/nm.service.keytab | Kerberos keytab file for the NodeManager. |
yarn.nodemanager.principal | nm/[email protected] | Kerberos principal name for the NodeManager. |
yarn.nodemanager.container-executor.class | org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor | Use LinuxContainerExecutor. |
yarn.nodemanager.linux-container-executor.group | hadoop | Unix group of the NodeManager. |
conf/mapred-site.xml
MapReduce JobHistory Server配置:
mapreduce.jobhistory.address | MapReduce JobHistory Server host:port | Default port is 10020. |
mapreduce.jobhistory.keytab | /etc/security/keytab/jhs.service.keytab | Kerberos keytab file for the MapReduce JobHistory Server. |
mapreduce.jobhistory.principal | jhs/[email protected] | Kerberos principal name for the MapReduce JobHistory Server. |
操作hadoop集群
一旦配置完成之后就把所有 HADOOP_CONF_DIR 里面的文件拷贝到其他节点上
此章节会说明不同的unix用户启动不同的hadoop服务,采用的unix系统用户和用户组
hadoop启动
启动hadoop集群你需要启动HDFS and YARN 集群
hdfs用户格式hadoop文件系统执行如下命令:
[hdfs]$ $HADOOP_PREFIX/bin/hdfs namenode -format <cluster_name>
在NameNode 节点上启动hdfs,用户为hdfs用户:
[hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode
在DataNodes 节点上启动DataNodes 用户为root,设置环境变量HADOOP_SECURE_DN_USER为hdfs:
[root]$ HADOOP_SECURE_DN_USER=hdfs $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start datanode
在ResourceManager 节点上执行如下命令启动YARN,用户为yarn:
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager
在其他从节点上执行如下命令启动NodeManagers,用户为yarn:
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager
用户yarn启动一个WebAppProxy 服务如果需要启动多个去负载均衡那么就用同样的方式启动多个:
[yarn]$ $HADOOP_YARN_HOME/bin/yarn start proxyserver --config $HADOOP_CONF_DIR
用mapred用户启动MapReduce JobHistory Server :
[mapred]$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR
hadoop集群关闭:
用户hdfs执行如下命令关闭NameNode :
[hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs stop namenode
root用户在所有从节点上执行如下命令停止DataNodes :
[root]$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs stop datanode
yarn用户在ResourceManager 节点上执行如下命令关闭ResourceManager:
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR stop resourcemanager
yarn用户在所有的从节点上执行如下命令结束NodeManagers:
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR stop nodemanager
yarn用户在WebAppProxy server.节点上执行如下命令停止WebAppProxy server.如果有多台那么依次:
[yarn]$ $HADOOP_YARN_HOME/bin/yarn stop proxyserver --config $HADOOP_CONF_DIR
mapred用户执行如下命令停止MapReduce JobHistory Server:
[mapred]$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh stop historyserver --config $HADOOP_CONF_DIR
Web监控页面
一旦集群启动之后可以通过web-ui监控进程运行情况:
NameNode | http:// nn_host:port/ | Default HTTP port is 50070. |
ResourceManager | http:// rm_host:port/ | Default HTTP port is 8088. |
MapReduce JobHistory Server | http:// jhs_host:port/ | Default HTTP port is 19888. |
已有 0 人发表留言,猛击->> 这里<<-参与讨论
ITeye推荐