当前位置:首页 > 报告详情

5.李呈祥 -Apache Spark最新技术进展和3.0+展望(21页).pdf

上传人: 懒人 编号:83829 2022-07-20 21页 2.02MB

1、Apache Spark最新技术进展和3.0+展望李呈祥(司麟)阿里巴巴高级技术专家计算平台事业部Agenda3.0Spark on CloudData Warehouse EnhancementSpark+AIA Unified Analytics Engine for Large-scale Data ProcessingEasy-to-use APIRich Ecosystem SupportEfficient EngineData Warehouse EnhancementDelta Lake ACID Transactions Scalable Metadata Handling T

2、ime Travel(data versioning)Open Format Unified Batch and Streaming Source and Sink Schema EnforcementComing soon:Audit HistoryFull DML SupportExpectationsData Source V2 Unified API for batch and streaming Flexible API for high performance implementation Flexible API for metadata management Target 3.

3、0Runtime OptimizationDynamic optimize the execution plan at runtime based on the statistic of previous stage.Self tuning the number of reducers Adaptive join strategy Automatic skew join handlingAdaptive ExecutionEMR Runtime Filter Filter big table with runtime statistic of join key.Support both par

4、titioned table and normal table.EMR Spark Relational CacheUser may analyze data in certain access patternRegularly join 2 tables?Regularly aggregate by certain fields?Regularly filter by certain fields?Data Organization:partition,bucket,sortfile index,zorderData pre-computation:pre-filterdenormaliza

5、tionpre-aggregationMake data adaptive to compute,so spark compute faster.EMR Spark Relational CacheEMR Spark Relational CacheEasy to build and maintainTransparent to userCREATE VIEW emp_flat AS SELECT*FROM employee,address WHERE e_addrId=a_addrId;CACHE TABLE emp_flatUSING parquetPARTITIONED BY(e_ob_

6、date)EAJFP-User Query-SELECT*FROM employee,address WHERE e_addrId=a_addrId and a_cityName=ShangHaiSpark OptimizerCFPEAJP-Cached Mata-emp_flatoptimized planSpark on CloudStorage and Computing DisaggregationWhy disaggregate storage and computing:Pay as you go.Scale independently of each other.More rel

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
Apache Spark技术进展报告聚焦于3.0版本及未来的发展方向。李呈祥(司麟),阿里巴巴高级技术专家,概述了Spark在云计算、数据仓库、AI大规模数据处理等方面的增强。Spark 3.0+将优化执行计划,动态调整reducer数量,采用自适应join策略和处理数据倾斜的优化方法。Delta Lake特性包括ACID事务、可扩展的元数据处理、时间旅行(数据版本控制)、开放格式和支持批处理与流处理源和目标。即将推出的功能包括审计历史、全面DML支持等。Spark优化器将生成优化的执行计划,而JindoFS填补了对象存储与计算框架之间的空白,提供文件系统API和元数据管理。在云计算方面,Spark将支持远程shuffle服务,提高存储和计算的弹性。此外,Spark 3.0+将支持动态资源分配、Kerberos认证,并计划与Hadoop 3.x和Hive 2.3兼容。Scala 2.12也得到全面支持。项目Hydrogen旨在将Spark整合为统一的AI处理管道,通过障碍执行模式和加速器感知调度优化AI作业。Spark ML库通过优化数据交换,支持GPU和其他加速器,如FPGA。这些进展旨在使Spark计算更快,同时降低成本,提高可靠性。
"Spark 3.0+有哪些新技术?" "如何在Spark中实现GPU加速?" "Spark on Cloud有哪些优势?"
客服
商务合作
小程序
服务号
折叠