OUR PARTNERS
Patronus AI Research: Language Model Struggles in Finance Industry
02 July, 2024
The integration of Artificial Intelligence (AI) in the financial sector has promised to revolutionize the way institutions operate, providing faster and more efficient services. However, recent assessments have demonstrated that the actual deployment of AI technologies in the finance industry is facing notable challenges, particularly with large language models (LLMs) and their capabilities in processing complex financial documents.
While AI tools like AI text generators and AI video generators are making strides, issues with accuracy and reliability in AI applications within finance raise concerns. A recent examination by Patronus AI, a startup specializing in AI testing, has shed light on these difficulties. They developed a rigorous benchmark known as FinanceBench, comprising over 10,000 questions sourced from Securities and Exchange Commission (SEC) filings of major publicly traded companies, and tested several AI models against it.
The expectation of AI in finance is to quickly extract significant data and effectively analyze the dense narrative of SEC filings, a process potentially advantageous for businesses seeking a competitive edge. Yet, despite these high hopes, AI models are proving to be less reliable than anticipated.
OpenAI’s GPT-4-Turbo, a sophisticated AI model, showcased a performance that fell short of industry requirements. When provided with nearly the entire relevant document, the model only managed to generate accurate responses for 79% of the tasks in the “Oracle” mode. This suboptimal accuracy is concerning for a sector where precision is paramount.
The inconsistent performance of AI in finance can be particularly problematic due to the nondeterministic nature of LLMs, which do not guarantee identical outputs for the same input consistently. This unpredictability necessitates exhaustive testing to ensure the reliability of AI-powered applications.
Patronus AI’s co-founders, with backgrounds at Meta where they tackled AI-related issues, have highlighted that evaluation of AI models has been predominantly manual. Through their work at Patronus AI, supported financially by Lightspeed Venture Partners, they aim to automate the testing process, creating confidence that AI will not produce surprise or erroneous results for users.
The challenges with AI in finance are perhaps best exemplified by Microsoft’s initial use of Bing Chat to summarize an earnings press release, using OpenAI’s GPT. Observers quickly found faults, with inaccuracies and invented figures emerging in the chatbot’s responses.
The research by Patronus AI tested various configurations of AI models, including Meta’s Llama 2 and Anthropic’s Claude 2, as well as OpenAI’s GPT-4 series. The “closed book” test – where no SEC documents were available to the AI – saw GPT-4-Turbo failing spectacularly, answering correctly only 14 out of 150 questions. When provided with “long context,” nearly containing the whole relevant document, the results were better yet flawed, with 79% accuracy for GPT-4-Turbo.
Llama 2 approximated correct answers only 19% of the time, while Claude 2 fared relatively better with a 75% success rate under “long context.” However, these figures still underscore the limited reliability of the current AI applications in adequately handling financial information.
As the industry advances, the potential for AI tools in finance continues to grow – from AI images generator that may help visualize financial data, to the latest AI news & AI tools that might streamline various financial processes. But the sector’s stringency regarding accuracy requires that AI technology must evolve to meet the industry’s high standards.
For AI to fully realize its potential in the financial industry, developers and researchers must overcome the challenges of ensuring these models can consistently offer precise and verified information. This journey would involve creating more sophisticated algorithms, better training data, and rigorous testing protocols.
The findings stressed by Patronus AI’s founders point to a future where rigorous testing becomes the standard, ensuring financial AI applications are robust and reliable. It’s clear that while AI holds significant promise for transforming finance, there’s still considerable progress to be made before these systems can be wholly entrusted with the industry’s demands.