1、2024 Databricks Inc.All rights reservedJinfeng Li and Erik OrdentlichJinfeng Li and Erik OrdentlichNVIDIANVIDIAJune 2024June 20241Spark RAPIDS ML:GPU Spark RAPIDS ML:GPU Accelerated Distributed Accelerated Distributed ML in Spark ClustersML in Spark Clusters2Scaling Apache Spark With GPUsRAPIDS Acce
2、lerator for Apache Spark2030202020102000HadoopSparkSpark GPUSpark 2.0 on CPUsGPU Accelerated Spark 3.x Growth in Requirement for Data ProcessingKey Spark 3 innovationsColumnar processing support in the Catalyst query optimizer allows efficient GPU accelerationGPU-aware scheduling of executors with a
3、 specified number of GPUs and how many GPUs for each task3NVIDIA RAPIDS AcceleratorKey technologies for GPU accelerationRAPIDSDATA ANALYTICS APPLICATIONS AND AI/ML PIPELINESGPU-ACCELERATED INFRASTRUCTUREAPACHE SPARK PLATFORMACCELERATED BATCH DATA PROCESSINGSpark SQLDataFramesRAPIDS Accelerator for A
4、pache SparkAmazon EMRGoogle Cloud DataprocRAPIDS Accelerator for Apache Spark MLACCELERATED SPARK MACHINE LEARNINGMLlib44No Query Changes Add jar to classpath and setspark.plugins config No change to SQL and DataFrame code Compatible with PySpark,SparkR,Java,Scala and other DataFrame-based APIs Seam
5、less fallback to CPU for unsupported operationsspark.sql(SELECT o_order_priority count(*)as order_count FROM orders WHERE o_orderdate=DATE 1993-07-01 AND o_orderdate DATE 1993-07-01+interval 3 month AND EXISTS(SELECT *FROM lineitem WHERE l_orderkey=o_orderkey AND l_commitdate 7)df.join(df2,name).sel
6、ect(df.weight,df2.height)pyspark.ml.clustering.KMeans().fit(df)RAPIDS Accelerator for Apache SparkRAPIDS Accelerator for Apache Spark ML88Package Import Change Compatible with pyspark.ml DataFrame APIs Requires no application code change Package import changefrom pyspark.ml.clustering import Kmeansk