OUR PARTNERS

Pentagon Partners with Scale AI for AI Safety


01 July, 2024

In the realm of advanced computing, artificial intelligence (AI) continues to break new ground, offering innovative solutions across various sectors—including the military. In a significant step forward, the Pentagon’s Chief Digital and Artificial Intelligence Office (CDAO) has embarked on an ambitious project to develop a robust framework for testing and evaluating large language models (LLMs), with potential applications that could be game-changing for military operations.

As the latest ai news & ai tools evolve, so does the complexity of ensuring their reliability in critical situations. This is where the partnership between the CDAO and technology specialists comes into play. The framework resulting from this partnership aims to facilitate the safe deployment of AI by measuring model performance, providing instantaneous feedback to military personnel, and crafting specialized public sector evaluation sets. These sets will be pivotal in scrutinizing AI models slated for military support tasks, such as sifting through data from after-action reports.

Generative AI technologies, which encompass large language models, can autonomously generate text, software code, AI images generator output, and other media from prompts given by users. Although these technologies hold immense promise, they also present unique and serious challenges due to their inherent complexity. Recognizing this, Pentagon leaders established Task Force Lima within the CDAO’s Algorithmic Warfare Directorate. The task force’s mission is to expedite the understanding, evaluation, and integration of generative AI into the Department of Defense (DoD).

Historically, the DoD has relied on meticulous test-and-evaluation (T&E) processes to certify that systems and technologies can perform safely and dependably prior to full deployment. However, universally accepted AI safety standards and policies remain to be codified. Adding to the challenge is the intricate nature of language and meaning within large language models, making T&E for generative AI an even more complicated endeavor.

Take, for instance, the process of evaluating a computer vision algorithm designed to differentiate between images of dogs, cats, and unrelated objects. An expert might train this algorithm with a myriad of animal pictures, as well as images of unrelated objects. To accurately assess the algorithm, the expert would also curate a separate dataset—a “ground truth”—against which the algorithm’s performance can be measured.

Adapting a comparable strategy to large language models, experts from the partnered technology enterprise will curate “holdout datasets” that include responses from DoD insiders, facilitating multiple layers of review and ensuring the responses meet military standards. Furthermore, as datasets pertaining to world knowledge, truthfulness, and more are refined, specialists can evaluate the performance of existing LLMs against them. The evaluation process will be designed to be iterative, fostering continuous improvement.

An essential component of the testing will involve establishing model cards—succinct documents that illuminate the optimal contexts for deploying different machine learning models and offer insights into assessing their performance. The adoption of automation in developing these model cards is anticipated, thus allowing for baseline assessments of newer models as they are proposed, predicting their areas of strength and potential challenges.

Culminating the process is the goal of equipping the AI models with the capability to alert DoD officials if their performance begins to deviate from tested domains. This work supports the DoD’s objective of maturing its T&E policies to suit generative AI, supplementing quantitative benchmarking with qualitative user feedback, and identifying AI models suitable for military applications using DoD-specific terminology and knowledge bases. The ultimate aim is to bolster the robustness and resilience of AI systems in classified and secure environments, paving the way for the adoption of LLM technology.

By testing and evaluating generative AI, the DoD gains a nuanced understanding of technology’s capabilities and limits, thereby informing responsible deployment strategies. Partnerships like these underscore the ambition to harness groundbreaking LLMs in service of national defense, signifying a significant advancement in technological application within the secure confines of military operations. With such endeavours, artificial intelligence generated images, text, and other media are set to transform the landscape of military technology, heralding a new era of innovation and application.