> Linux集群 > Hadoop >

Hadoop CDH5 Impala部署

Cloudera发布了实时查询开源项目Impala!多款产品实测表明,比原来基于MapReduce的Hive SQL查询速度提升3~90倍。Impala是Google Dremel的模仿,但在SQL功能上青出于蓝胜于蓝。

CDH5 Impala 安装

1     impala由四部分组成:

 

impalad - Impala的守护进程. 计划执行数据查询在HDFS和HBase上。 在集群中的每个数据节点上运行一个守护进程
statestored - 跟踪集群中的所有impala实例的状态,在集群中的一个节点上运行该程序。
catalogd - Metadata协调服务  
impala-shell - 命令行接口
2     安装环境 

        1     增加ubuntu12.04的源

 

deb [arch=amd64] http://archive.cloudera.com/impala/ubuntu/precise/amd64/impala precise-impala1 contrib 
deb-src http://archive.cloudera.com/impala/ubuntu/precise/amd64/impala precise-impala1 contrib

        2     建议硬件配置

                    128-256G内存,单机12块硬盘,CPU  

3     依赖注意点

            Impala依赖于Hive的metastore,所以需要你先配置Hive,在配置Hive时最好使用单独的metastore服务。Hive的metastore安装过程如下:

1    Install a MySQL or PostgreSQL database. Start the database if it is not started after installation. 
2    Download the MySQL connector or the PostgreSQL connector and place it in the /usr/share/java/ directory. 
3    Use the appropriate command line tool for your database to create the metastore database. 
4    Use the appropriate command line tool for your database to grant privileges for the metastore database to the hive user. 
5    Modify hive-site.xml to include information matching your particular database: its URL, user name, and password. You will copy the hive-site.xml file to the Impala Configuration Directory later in the Impala installation process.

4     Installing Impala without Cloudera Manager

        1     安装Hive,因为impala依靠hive的metastore

                参考:Hive安装

        2     配置hive-site.xml

                参考:Hive安装

        3     增加impala安装源

 

deb [arch=amd64] http://archive.cloudera.com/impala/ubuntu/precise/amd64/impala precise-impala1 contrib 
deb-src http://archive.cloudera.com/impala/ubuntu/precise/amd64/impala precise-impala1 contrib

        4     在ubuntu12.04开始安装

 

apt-get install impala             # Binaries for daemons 
apt-get install impala-server      # Service start/stop script 
apt-get install impala-state-store # Service start/stop script 
apt-get install impala-catalog     # Service start/stop script 
注意:Cloudera recommends that you not install Impala on any HDFS NameNode

        5     复制hive-site.xml, core-site.xml, hdfs-site.xml, and hbase-site.xml文件到impala配置目录,默认使/etc/impala/conf目录

        6     安装impala客户端

 

apt-get install impala-shell

5     Post-Installation Configuration for Impala

        1     注意事项

                    你必须开启short-circuit reads选项,不论你用什么方法安装Impala。如果你不是采用Cloudera Manager来安装Impala,则必须开启block location tracking选项。

        2     Mandatory: Short-Circuit Reads 

                    开启short-circuit reads选项,允许impala从文件系统直接读取本地数据,这一步移除了和datanode的通讯,提升了效率。而且也最小化了数据拷贝的份数,short-circuit reads需要libhadoop.so库。short-circuit reads不支持CDH4.1之前的版本。 

                    1     在所有的impala节点上,修改hdfs-site.xml配置文件 

 

<property> 
    <name>dfs.client.read.shortcircuit</name> 
    <value>true</value> 
</property> 


<property> 
    <name>dfs.domain.socket.path</name> 
    <value>/var/run/hadoop-hdfs/dn._PORT</value> 
</property> 


<property> 
    <name>dfs.client.file-block-storage-locations.timeout.millis</name> 
    <value>10000</value> 
</property>

                    2     /var/run/hadoop-hdfs/确保该目录的组用户为root

                    3     复制core-site.xml and hdfs-site.xml 文件到impala的配置目录,如果你要开启block location tracking特性,则暂时不需要拷贝文件和重启datanode节点服务,只需要一起修改完之后,再操作。

                    4     重启所有的Datanode上的服务

            3     Mandatory: Block Location Tracking

                        开启block location metadata则允许impala知道磁盘上的数据块位于哪里,从而更好的利用底层磁盘。 

                    1     在每个Datanode的hdfs-site.xml文件中增加如下配置

 

<property> 
  <name>dfs.datanode.hdfs-blocks-metadata.enabled</name> 
  <value>true</value> 
</property>

                    2     core-site.xml and hdfs-site.xml文件到impala的配置目录,/etc/impala/conf

                    3     重启所有datanode

             4     Configuring Impala to Work with JDBC

                        1     Configuring the JDBC Port

                                默认的JDBC 2.0使用的使21050端口,impala服务监听JDBC链接也是在21050端口上的,所以你要确保这个端口是可用的,不过你也可以通过参数设置不同的端口 

                        2     Enabling Impala JDBC Support on Client Systems

                                如果那个客户端要和impala进行JDBC通讯,则该客户端需要包含一个jar zip文件,下载这个文件到每个客户端机器上。 

                                        1     下载

 

wget https://downloads.cloudera.com/impala-jdbc/impala-jdbc-0.5-2.zip

                                        2     zip包加入相关路径下

                                                    例如加入/opt/jars/目录下 

                                        3     配置客户端的程序可以加载到这个jar包

 

export CLASSPATH=/opt/jars/*.jar:$CLASSPATH

                                        4     Establishing JDBC Connections

 

jdbc:hive2://myhost.example.com:21050/;auth=noSasl

6     Starting Impala from the Command Line

        1     statestore服务可以帮助Impala有效的分发任务

 

service impala-state-store start

        2     启动catalog服务

 

service impala-catalog start

        3     在一个或者多个DataNodes上面启动Impala服务

 

service impala-server start

7     Modifying Impala Startup Options

        1     Configuring Impala Startup Options through the Command Line

                    当你启动Impala server, statestore, and catalog services等服务时,它们使用的参数值都是在/etc/default/impala中设置的。 Cloudera建议你把statestore运行在一台单独的机器上,不要和impalad守护进程运行在同一台机器上,Cloudera还建议你可以把catalog服务可以和statestore服务运行在同一台机器上。

         2      Checking the Values of Impala Configuration Options 
http://impala_hostname:25000/varz (impalad) 
http://impala_hostname:25010/varz (statestored) 
http://impala_hostname:25020/varz (catalogd)

 

 





(责任编辑:IT)