1、US AISI1and UK AISI2JointPre-Deployment TestOpenAI o1December 20241US AI Safety InstituteNational Institute of Standards and Technology2UK AI Safety InstituteDepartment of Science Innovation and TechnologyContents1Introduction11.1Disclaimer.11.1.1Limitations to Results.12Methodology12.1Pre-deploymen
2、t Evaluation.12.2Evaluated Models.22.3Agent Design.22.4Task Iterations and Cost.32.5Presenting Uncertainty.42.6Model-Sampling Parameters.4ICyber Capabilities Evaluations53US Cyber Capability Evaluation Methodology53.1Cybench Dataset.53.2Agent Methodology and Scoring.63.3Transcript Review.64US AISI C
3、yber Evaluation Results64.1Average Success Rates.64.2Per-Task Results.64.3Messages to Solve.85Opportunities for Future Work on US AISI Cyber Evaluations96UK AISI Cyber Evaluation Methodology96.1Agent Methodology and Scoring.117UK AISI Cyber Evaluation Results117.1Vulnerability Discovery and Exploita
4、tion.117.2Network Operations.137.3OS Environments.137.4Cyber Attack Planning and Execution.148Opportunities for Future Work on UK AISI Cyber Evaluations141IIBiological Capabilities Evaluations179US AISI Biological Evaluation Methodology179.1LAB-Bench Dataset.179.2Tool Use.189.3Scoring.1810 US AISI B
5、iological Evaluation Results1910.1 Primary Performance Measurements.1910.2 Tool Use Ablations.1910.3 Results with Abstention.2010.4 Free response answer choice configuration.2011 Opportunities for Future Work on US AISI Biological Capabilities Evaluations23IIISoftware and AI Development Evaluations2
6、412 US AISI Software and AI Development Evaluation Methodology2412.1 MLAgentBench Dataset.2412.2 Agent Methodology.2512.3 Scoring.2513 US AISI Software and AI Development Evaluation Results2613.1 Average Normalized Score.2613.2 Per-Task Results.2714 Opportunities for Further Work on US AISI Software