Apache Spark 2.0.0 发布,APIs 更新
时间:2016-08-02 22:05 来源:linux.it.net.cn 作者:IT
Apache Spark 2.0.0 发布了,Apache Spark 是一种与 Hadoop 相似的开源集群计算环境,但是两者之间还存在一些不同之处,这些有用的不同之处使 Spark 在某些工作负载方面表现得更加优越,换句话说,Spark 启用了内存分布数据集,除了能够提供交互式查询外,它还可以优化迭代工作负载。
该版本主要更新APIs,支持SQL 2003,支持R UDF ,增强其性能。300个开发者贡献了2500补丁程序。
Apache Spark 2.0.0 APIs更新记录如下:
-
Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset have been unified, i.e. DataFrame is just a type alias for Dataset of Row. In Python and R, given the lack of type safety, DataFrame is the main programming interface.
-
SparkSession: new entry point that replaces the old SQLContext and HiveContext for DataFrame and Dataset APIs. SQLContext and HiveContext are kept for backward compatibility.
-
A new, streamlined configuration API for SparkSession
-
Simpler, more performant accumulator API
-
A new, improved Aggregator API for typed aggregation in Datasets
Apache Spark 2.0.0 SQL更新记录如下:
-
A native SQL parser that supports both ANSI-SQL as well as Hive QL
-
Native DDL command implementations
-
Subquery support, including
-
Uncorrelated Scalar Subqueries
-
Correlated Scalar Subqueries
-
NOT IN predicate Subqueries (in WHERE/HAVING clauses)
-
IN predicate subqueries (in WHERE/HAVING clauses)
-
(NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)
-
View canonicalization support
一些新特性:
-
Native CSV data source, based on Databricks’ spark-csv module
-
Off-heap memory management for both caching and runtime execution
-
Hive style bucketing support
-
Approximate summary statistics using sketches, including approximate quantile, Bloom filter, and count-min sketch.
性能增强:
-
Substantial (2 - 10X) performance speedups for common operators in SQL and DataFrames via a new technique called whole stage code generation.
-
Improved Parquet scan throughput through vectorization
-
Improved ORC performance
-
Many improvements in the Catalyst query optimizer for common workloads
-
Improved window function performance via native implementations for all window functions
-
Automatic file coalescing for native data sources
更多发布信息,可查看发布说明。
下载地址:http://spark.apache.org/downloads.html
(责任编辑:IT)
Apache Spark 2.0.0 发布了,Apache Spark 是一种与 Hadoop 相似的开源集群计算环境,但是两者之间还存在一些不同之处,这些有用的不同之处使 Spark 在某些工作负载方面表现得更加优越,换句话说,Spark 启用了内存分布数据集,除了能够提供交互式查询外,它还可以优化迭代工作负载。 该版本主要更新APIs,支持SQL 2003,支持R UDF ,增强其性能。300个开发者贡献了2500补丁程序。 Apache Spark 2.0.0 APIs更新记录如下:
Apache Spark 2.0.0 SQL更新记录如下:
一些新特性:
性能增强:
更多发布信息,可查看发布说明。 下载地址:http://spark.apache.org/downloads.html (责任编辑:IT) |