详解Hadoop更快排序的方法

在Hadoop中，键默认的排序处理方法是这样的：
从一个流中读键类型的实例，使用键类型的readFields()方法来解析字节流，然后对这两个对象调用compareTo()方法。

其实，还可以实现更快的排序，可以只通过检视字节流而不用解析出包含在其中的数据来判断这两个key的顺序。
比如，考虑比较字符串文本。如果字符按照顺序读入，就可以在第一个字符不同的地方确定它们的顺序。
即使是需要读入所有的字节，对象自身也没有初始化的必要。

要支持这个高速的排序机制，可以在数据类型的比较器实现中继承WritableComparable类。

然后，重载如下方法：

复制代码代码示例:

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)

所有默认的实现是在org.apache.hadoop.io.WritableComprator中。

相应的方法：

复制代码代码示例:

		public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2){  

		try{  

		buffer.reset(b1, s1, l1);  

		key1.readFields(buffer);  

		buffer.reset(b2, s2, l2);  

		key2.readFields(buffer);  

		}catch(IOException e){  

		throw new RuntimeException(e);  

		}  

		return compare(key1, key2);  

		}

操作；
在它们被各自从各自的字节流中反序列化出来之后，两个对象就进行了直接的比对。
两个对象必须是全结构的并且在比对发生之前必须被反序列化。
Text类，允许通过重载这个方法实现增量比对。

代码：

复制代码代码示例:

		/** A WritableComparator optimized for Text keys. */

		public static class Comparator extends WritableComparator {  

		  public Comparator() {  

		    super(Text.class);  

		}  

		public int compare(byte[] b1, int s1, int l1,  

		          byte[] b2, int s2, int l2) {  

		    int n1 = WritableUtils.decodeVIntSize(b1[s1]);  

		    int n2 = WritableUtils.decodeVIntSize(b2[s2]);  

		    return compareBytes(b1, s1+n1, l1-n1, b2, s2+n2, l2-n2);  

		  }  

		}

代码说明：
Text对象序列化，首先将它的长度字段写入到字节流中，然后是一个UTF编码的字符串。

方法decodeVIntSize确定了描述字节流长度的整形数的长度。比较器跳过这些字节，直接比对UTF编码的真实的字符串部分的字节，比较是通过compareBytes方法实现的。
一旦找到一个不同的，然后就返回结果，后面的不管。

注意，无需手动在Hadoop程序中指名这个比较器。
只需注册一下即可，Hadoop会自动使用它，例如：

复制代码代码示例:

		static {  

		  // register this comparator  

		  WritableComparator.define(Text.class, new Comparator());  

		}

(责任编辑：IT)

搜索

热门标签:

详解Hadoop更快排序的方法