OpenAI, the company behind ChatGPT, has introduced an assessment tool called SimpleQA to evaluate how accurately AI models generate factual answers. Recent results underscore that even top-tier models still produce many incorrect responses.
This assessment highlights the difficulty of delivering consistently accurate information, with AIs often displaying high confidence in their incorrect answers, raising important concerns about their reliability.
SimpleQA development
The SimpleQA benchmark by OpenAI is designed to test AI’s ability to answer short, fact-based questions with verifiable answers. OpenAI explains that factuality remains challenging for AIs, especially when measuring accuracy across broad contexts.
Unlike more complex, multi-fact queries, SimpleQA focuses on straightforward fact-checking with concise questions. This narrower scope simplifies the assessment of factuality. To ensure answer quality, OpenAI enlisted two independent AI trainers to create a question set with 4,326 items spanning topics like science, politics, culture, and technology. Only questions where trainers agreed on answers made it into the final benchmark. Additionally, a third reviewer checked 1,000 items, affirming a high internal accuracy rate of about 94.4%.
In evaluating responses, SimpleQA categorizes each answer generated by ChatGPT’s models as “correct,” “incorrect,” or “not answered.” The “not answered” label specifically gauges the AI’s capacity to recognize its own limitations and refrain from offering potentially incorrect responses.
Results with OpenAI models
OpenAI tested various versions of its models using the SimpleQA benchmark, revealing notable accuracy differences based on model size and structure.
Smaller models, like GPT-4o-mini and o1-mini, expectedly had lower accuracy rates, aligning with their limited knowledge base. However, all models displayed a high rate of incorrect responses, highlighting ongoing challenges with accuracy.
Here are the results for each model tested:
- GPT-4o-mini: 8.6% correct answers, 0.9% unanswered, and 90.5% incorrect.
- o1-mini: 8.1% correct answers, 28.5% unanswered, and 63.4% incorrect.
- GPT-4o: 38.2% correct answers, 1.0% unanswered, and 60.8% incorrect.
- o1-preview: The top performer, with 42.7% correct answers, 9.2% unanswered, and 48% incorrect.
OpenAI observed that models like o1-mini and o1-preview, designed with enhanced reflection mechanisms, tended to choose “not answer” more often than the GPT-4o-mini and GPT-4o versions. This behavior suggests that models with stronger reasoning abilities are better equipped to recognize situations lacking sufficient information, reducing the likelihood of generating inaccurate or “hallucinated” content.
This approach aims to improve the reliability of AI-generated responses by minimizing incorrect information when uncertainty is detected.
Far from factual accuracy
The results from the SimpleQA benchmark underscore the persistent hurdles OpenAI and other companies face in enhancing the factual reliability of generative AI models. One of the key challenges remains the models’ tendency to confidently deliver incorrect answers, which poses a significant obstacle for use in fields where accuracy and reliability are paramount.
An encouraging step forward, however, is the models’ emerging ability to recognize when they lack sufficient information, opting not to respond in such cases. Yet, this capability remains limited across many versions, indicating that there’s still considerable progress to be made.
OpenAI noted that, while recent advancements are promising, achieving complete factual accuracy is still a distant objective. The company emphasizes that future improvements should aim to expand the breadth of the models’ knowledge and focus on refining their ability to critically evaluate information for accuracy before responding. This dual approach could lead to more reliable performance and mitigate risks associated with misinformation in AI-generated content.