Databricks, a leading data and AI company, is transforming how its customers use advanced retrieval augmented generation (RAG) systems for documentation, coding assistance, and bug fixing. To maintain the high standards their clients expect, Databricks needed a reliable, customizable, and data-driven solution to evaluate and select foundational models for their RAG systems.
Databricks implemented an "LLM as a judge" setup, where large language models (LLMs) evaluate the outputs of other models. However, the subjective nature of evaluating model responses and the complexities of gathering and processing feedback at scale posed significant challenges. To address these issues, Databricks partnered with SuperAnnotate, an industry-leading AI data platform, to create a highly customized evaluation process that enhanced their LLMs' performance while maintaining quality and consistency.
“Before deploying LLM as a judge in your use case, ensure it's grounded by human feedback. SuperAnnotate has been an invaluable partner in creating the golden datasets needed to benchmark our models and in enabling us to craft high-quality prompts and instructions for our LLM Judges.”
— Quinn, Senior Software Engineer - ML Training
The challenge
There were two main challenges in building a highly efficient "LLM as a judge" system:
- Subjectivity in evaluation: Evaluating an LLM can be inherently subjective, as people often have different opinions on what makes a response 'good.' This variation can result in inconsistent evaluations and makes it challenging to create prompts that align LLM assessments with human judgments.
- Feedback collection difficulties: Collecting and processing a ground truth dataset to benchmark the LLM judge setup was time-consuming and resource-intensive. The process required a customized technology stack and extensive manual effort to clean and review the data, which slowed down the development cycle.
The solution: Partnering with SuperAnnotate
To overcome these challenges, Databricks leveraged SuperAnnotate’s highly customizable data annotation platform and expertise in managing large-scale AI data projects. The collaboration focused on building a solution tailored to Databricks’ specific requirements, resulting in a scalable and efficient evaluation process:
- Standardizing evaluation with a grading rubric: SuperAnnotate worked closely with Databricks to develop a comprehensive, customized grading rubric. This framework included illustrative examples and dynamic annotation rules, ensuring consistency across evaluators. SuperAnnotate's expertise in running model evaluation projects was crucial in refining Databricks' grading rules and model prompts, ensuring consistency and reliability in evaluations.
- Purpose-built tooling for RAG evaluation: Leveraging SuperAnnotate’s customizable GenAI tool, Databricks implemented an intuitive evaluation system designed to handle the complexities of RAG models. This tailored solution allowed evaluators to efficiently assess model responses and retrieve context according to the detailed grading rubric, ensuring high accuracy and quality.
- Efficient feedback collection and data processing: SuperAnnotate's QA tools, automation, and service team proficiency in managing annotation projects have enabled Databricks to quickly collect and process the necessary data. The collaboration ensured rigorous testing of the agreement between human evaluators and the model, leading to more accurate and reliable evaluations.
Results
The standardized rubric allowed Databricks to achieve the same alignment grade with human evaluators as GPT-4. It made the more cost-effective GPT-3.5 with few-shot learning for evaluations a viable option. Switching to this model can result in a tenfold reduction in evaluation costs and a three-times increase in speed.
Conclusion
SuperAnnotate proved to be an invaluable partner for Databricks, enabling the successful deployment of a highly effective "LLM as a judge" system. By standardizing the evaluation process and streamlining feedback collection, SuperAnnotate helped Databricks achieve significant cost savings and improve the efficiency of its model evaluation process.