《无需更改 Python UDF 中的代码即可进行箭头优化.pdf》由会员分享,可在线阅读,更多相关《无需更改 Python UDF 中的代码即可进行箭头优化.pdf(30页珍藏版)》请在三个皮匠报告上搜索。
1、No-Code Change in Your Python UDF for Arrow OptimizationHyukjin KwonJune 10,2025IntroductionWhat Im presenting todayRun custom logicFlexible data processing and ability to use external Python librariesWidely used in the PySpark ecosystemHowever,they are slow,and traditional Pandas UDFs are hard to l
2、earnOptimization using Apache ArrowCan be turned on or off easily with environment settings and a new parameterSeveral times faster performanceNo need to change existing code just enable one environment setting(Databricks-specific)Integration with Photon is in progressWhat is a Python UDF?Arrow-opti
3、mized Python UDFBenchmark results!5Python UDFWhat is a Python UDF?User Defined FunctionWrite data processing logic directly in Python code within PySparkfrom pyspark.sql.functions import udfudfdef my_upper(s):return s.upper()df.select(my_udf(df.name)7Why is it widely used?Enables interoperability wi
4、th other systems using external libraries not supported by PySparkExisting logic that used to run on a single node can be reused and scaled out across a distributed systemfrom pyspark.sql.functions import udfudfdef my_existing_func(s):import thirdpartyreturn pute(s)df.select(my_existing_func(df.name
5、)8Integrate with existing working logic and external librariesWhy is it widely used?PySparks built-in expressions make it difficult to represent complex conditions or custom logicPython UDFs can solve these cases much more easilyimport refrom pyspark.sql.functions import udfudfdef clean_username(ema
6、il):username=email.split()0return re.sub(ra-zA-Z,username)#Extracts only the username from an#email address and filters out special#characters or digitsdf.select(clean_udf(df.email)9Complex logic hard to express with SQL or the standard DataFrame APIBut the performance is.10Spark ExecutorPython work