1、2024 Databricks Inc.All rights reservedChris Maddock Chris Maddock-Head of Product Marketing Head of Product Marketing-unstructured.iounstructured.ioColton Peltier Colton Peltier-Staff Data Scientist Staff Data Scientist-databricksdatabricks3SIMPLIFYING DATA SIMPLIFYING DATA INGESTION FOR LLMs INGES
2、TION FOR LLMs WITH UNSTRUCTURED AND WITH UNSTRUCTURED AND DATABRICKSDATABRICKS2024 Databricks Inc.All rights reserved4Why use unstructured data?Why use unstructured data?2024 Databricks Inc.All rights reservedPowerpointsWebpagesVideosMeeting notesInternal documentsEmailsCodebasesAudioImages5Unstruct
3、ured data is a goldmineUnstructured data is a goldmine90%of enterprise data is unstructured*90%of enterprise data is unstructured*Source:IDC White Paper,Sponsored by Box Inc.,“Untapped Value:What Every Executive Needs to Know About Unstructured Data,”Doc.US51128223,August 20232024 Databricks Inc.All
4、 rights reservedEmbed unstructured data into a vector storeRetrieve relevant unstructured context given a queryUse unstructured data sources to augment LLM generation of responsesAllows LLMs to respond with up to date info or context specific language6RAGRAGunstructured data use case#1unstructured d
5、ata use case#12024 Databricks Inc.All rights reservedImprove retrieval by fine-tuning sentence-transformer on internal documentsTraining of re-ranker modelsParaphrase miningText clustering7Embedding ModelsEmbedding Modelsunstructured data use case#2unstructured data use case#22024 Databricks Inc.All
6、 rights reservedInstruction Fine Tune(IFT)an LLM to adapt to your use case(s)Update language understanding of model with new information or to a new domain with continued pre-training(CPT)Train an LLM from scratch on custom dataset with pre-training(PT)8Training LLMsTraining LLMsunstructured data us