1、Table of ContentsTitlePageExecutive Summary00Chapter 1-Introduction011.1Rationale011.2Target outcomes021.3Ground rules02Chapter 2-Pilot participants and use cases032.1Participant profile032.2Use cases042.3Patterns of LLM usage05Chapter 3-Risk Assessment and Test Design063.1Risk Assessment063.2Metric
2、s073.3Testing approach:Test datasets073.4Testing approach:Evaluators08-09Chapter 4-Test Implementation104.1Test Environment104.2Test data and effort104.3Implementation challenges10Chapter 5-Lessons learnt115.1Test what matters11-125.2Dont expect test data to be fit for purpose13 Guest Blog:Learning
3、from self-driving cars:Simulation Testing14 Guest Blog:Synthetic Data for Adversarial Testing155.3Look under the hood165.4Use LLMs as judges,but with skill and caution17 Guest Blog:LLM-as-a-judge:Pros and Cons185.5Keep your human SMEs close!19 Guest Blog:LLMs cant read your mind20Chapter 6-Whats nex
4、t?21-22Executive SummaryFrom Model Safety to Application ReliabilityAs Generative AI(“GenAI”)transitions from personal productivity tools and consumer-facing chatbots into real-world environments like hospitals,airports and banks,it faces a higher bar on quality and confidence.01Risk assessments dep
5、end heavily on the context of the use case e.g.,lower tolerance for error in a clinical application than a customer service chatbot.02Given the higher complexity involved in integrating foundation models with existing data sources,processes and systems,there are more potential points of failure.Howe
6、ver,much of the current work around AI testing focuses on the safety of foundation models,rather than the reliability of end-to-end applications.The Global AI Assurance Pilot was an attempt to address this gap:not through academic research,but by building upon real-life experiences of practitioners