1、2024 Databricks Inc.All rights reservedAUTOFEEDBACK:SCALING HUMAN FEEDBACK WITH CUSTOM EVALUATION MODELSArjun Bansal,Log10.ioArjun Bansal,Log10.ioJune 13,2024June 13,202412Outline1.Problem:Challenges in task specific evaluation of LLM applications2.Solution:AutoFeedback3.Results4.Deployment architec
2、ture5.Next steps/Free trial3Measuring and improving LLM accuracy today is hardOut of the box accuracy Human review is the gold standard but time consuming and expensiveAI based approaches are biased and inaccurateLLM Evaluators Recognize and Favor their own generations.Panickssery et al.,2024Models
3、prefer their own outputsPositional biasVerbosity biashttps:/huggingface.co/blog/open-llm-leaderboard-rlhfChallengesChallenges?SolutionAccurate,Unbiased evaluationLLM as a judgeLog10 AutoFeedback11AutoFeedbackScale human review of LLM output with custom AI modelsDatasetTL;DR dataset(Volske et al.,201
4、7,Stiennon et al.,2020)Reddit summariesSummary grading taskAxes such as coherence,accuracy,coverage and overall scored on a 1-7 rangeQualitative comment/reviewer reasoning Training superset:Subset of 5521 examplesTest:Different subset of 100 examplesDetailed rubricRubricYou are an evaluator of summa
5、ries of articles on reddit.You are tasked with grading the summaries for accuracy,coherence,coverage and overall.CoherenceFor this axis,answer the question“how coherent is the summary on its own?”A summary iscoherent if,when read by itself,its easy to understand and free of English errors.A summary
6、isnot coherent if its difficult to understand what the summary is trying to say.Generally,its moreimportant that the summary is understandable than it being free of grammar errors.Rubric:Score of 1:The summary is impossible to understand.Score of 4:The summary has mistakes or confusing phrasing that