1、2024 Databricks Inc.All rights reservedRAPID PYSPARKRAPID PYSPARKIMPLEMENTATION IMPLEMENTATION ON TIME SERIES ON TIME SERIES BIG DATA BIG DATA Megha Rajam Rao|Gary Garcia MolinaMegha Rajam Rao|Gary Garcia MolinaJune 12June 12thth,2024,202412024 Databricks Inc.All rights reservedData Science and Mach
2、ine Learning (Advanced)Data Science and Machine Learning (Advanced)Breakout SessionBreakout Session2Rapid Pyspark custom Rapid Pyspark custom processing processing on time series Big on time series Big data in Databricksdata in DatabricksPowered byPowered by2024 Databricks Inc.All rights reserved202
3、4 Databricks Inc.All rights reservedIs Big data processing Is Big data processing time consuming?time consuming?32024 Databricks Inc.All rights reserved4ScalableEfficient(Fast)ConsistentSolution2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved5AgendaAgenda Introduction D
4、ataset Clusters Methods Results Conclusion2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reserved6INTRODUCTIONINTRODUCTION2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedBACKGROUNDMETHODSOLUTIONBACKGROUNDMETHODSOLUTION7OVERVIEWOVERVIEWGoal:Goal:To
5、 quantify weight changes and their association with sleep using Sleep Number Smart beds equipped with force sensors.Problem:Problem:Raw readings were noisy due to user movements necessitating denoising by cleaning each rolling window of the big data.Methodology:Methodology:Entropy measure calculated
6、 using Pandas and Pyspark implementations were utilized to clean and denoise the dataset.Experimentation:Experimentation:Different configurations of single and multi-node clusters in Databricks were tested on datasets with 10 to 50 million datapoints for optimal performance evaluation.Result:Result: