《01Hudi_Presto 在 News Break 数据平台的尝试--关立胜.pdf》由会员分享,可在线阅读,更多相关《01Hudi_Presto 在 News Break 数据平台的尝试--关立胜.pdf(32页珍藏版)》请在三个皮匠报告上搜索。
1、Lisheng GUAN,March 2023Fast Ingestion,Query upon Unified SchemaA modern data platform try at NewsBreakNewsBreakData Arch at NewsBreakPipeline at NewsBreakPipeline at NewsBreakLegacy CDH to AWSHours 15min9s p95Hudi at NewsBreak1.Multi Sink2.Join first then Sink Hudi at NewsBreakPerformanceHudi at New
2、sBreakRefinementJoin for late dataExtra upsertHudi at NewsBreakMetrics10m source-limit 10GB source-limit 50 BN.written/mo 30 TB written/mo 3-10 min sync interval Hudi at NewsBreakDetails 1.Hudi 0.10.1 on EMR 5.36,from 0.9(2022)and 0.7(2021)2.Default gzip is sufficient,30%better than SNAPPY 3.DeltaSt
3、reamer,low codeMoRBackport features:Protobuf schema support Customize payload class:partial updateCustomize transformer class:filtering and basic metricsFileBasedSchemaProvider+ProtoClassBasedSchemaProviderJsonKafkaSource+JsonDFSSource+HoodieIncrSource4.HMS,and Presto/SparkHudi at NewsBreakTips1.set record.size.estimate explicitly(especially=max queryexecutiontime4.Time window schedule vs fix size window scheduleFast Ingestion with HudiFast Query with PrestoUnified Schema RegistryPresto event streamAnother tiny example