How GumGum fine-tunes LMs with SuperAnnotate and Databricks

GumGum, a leader in Contextual Intelligence, excels in digital advertising by leveraging AI to ensure Brand Safety and precise contextual targeting. GumGum trains advanced language models (LMs) to classify text, images, audio, and video according to specific criteria to achieve this. A key to their success lies in the partnerships with SuperAnnotate and Databricks, which accelerates their data annotation and model fine-tuning processes.

"By using SuperAnnotate’s data labeling tools alongside Databricks’ extensive data processing capabilities, we’ve really streamlined how we create and refine our training datasets. It cuts out the need for constant manual data transfers and increases our efficiency. This means we can deliver sharp and contextually relevant advertising to our clients with greater precision." Iris Fu, GumGum Dir of AI Engineering

Challenges in AI-Driven Advertising

In the dynamic digital advertising landscape, GumGum faces several key challenges:

Data Quality: High-quality, accurately labeled data is critical for model performance. Even slight inaccuracies can degrade model efficacy, making precise data annotation essential.
Iterative Updates: The process of continuously refining data and models necessitates efficient workflows.
Adapting to Change: GumGum must continually refine its models to stay relevant as the digital landscape evolves. This requires an agile approach to updating taxonomies and retraining models.

How SuperAnnotate and Databricks help facilitate this

SuperAnnotate accelerates dataset creation by leveraging pre-labeling with large language models (LLMs) and human review, integrating with GumGum's model training infrastructure on Databricks through the SuperAnnotate SDK. In addition to Databricks and SuperAnnotate, GumGum uses AWS in its data operations, providing secure, scalable storage via S3 that integrates closely with SuperAnnotate. This overall pipeline enhances data quality and streamlines workflows, ensuring that GumGum’s AI models are accurate and state-of-the-art.

SuperAnnotate for Pre-Labeling and Data Curation

SuperAnnotate plays a crucial role in GumGum’s workflow by enabling Model-in-the-Loop Labeling. This process involves using LLMs for pre-annotations, which human annotators then fine-tune.

Pre-Labeling with LLMs: By using LLMs to pre-label data, GumGum significantly reduces noise and enhances the accuracy of their annotations. This pre-labeling step is crucial for improving data quality before the final annotation stage.
Explore Tool for Data Curation: SuperAnnotate’s Explore tool enables GumGum’s data scientists to monitor annotation distributions and conduct spot checks, further ensuring the accuracy and consistency of the labeled data.

Databricks for Data Processing and Model Fine-Tuning

Databricks is the central hub for GumGum’s data processing and model training efforts. The integration with SuperAnnotate enhances this setup by streamlining the transition from data annotation to model training and enabling Active Learning.

In-Depth Analytics and Workflow Integration: Using the SuperAnnotate Python SDK within Databricks, GumGum can perform detailed analytics on labeled data, manage the model-in-the-loop setup, and efficiently transfer annotated data. This reduces the complexity and time required for data preparation.
Efficient Model Fine-Tuning: Once the data is labeled, GumGum fine-tunes its Language Models within the Databricks environment. The process is further enhanced by utilizing SuperAnnotate's Explore tool, which allows data scientists to iteratively improve data annotations, which can directly be fed into the model training and evaluation.
Active Learning: GumGum identifies underperforming areas in the model and sends additional data to SuperAnnotate for targeted labeling, continuously improving model performance.

Results and Impact

The integration of SuperAnnotate and Databricks has significantly improved GumGum's operational efficiency.

Enhanced Data Quality: By leveraging LLMs for pre-labeling, GumGum has achieved a ten percentage point increase in the F1 score of the labeled data, resulting in higher-quality models.
Increased Efficiency: The streamlined workflows have reduced the time needed to prepare data for model training, allowing GumGum to iterate faster and maintain a competitive edge in the advertising industry.

Key takeaways

GumGum’s partnership with SuperAnnotate and Databricks has been instrumental in streamlining operational workflows for LLM fine-tuning. By integrating SuperAnnotate's advanced annotation tools with pre-labeling LLMs and leveraging Databricks’ robust processing capabilities, GumGum continues to excel in delivering contextually relevant and secure advertising solutions.

Recommended for you

Stay connected

Subscribe to receive new blog posts and latest discoveries in the industry from SuperAnnotate

Thank you for subscribing to our newsletter!

Oops! Something went wrong while submitting the form.

Contents