Better Summaries: Building Domain-Specific Clustering¶

Series Overview: This is the second notebook in our three-part series on systematically analyzing and improving RAG systems. In the first notebook, we discovered query patterns but found limitations with generic summaries. Now we'll fix that.

Prerequisites: Complete "1. Cluster Conversations" notebook first. You'll need the same dependencies and GOOGLE_API_KEY from the previous notebook.

Why This Matters¶

The generic summaries from our initial clustering missed crucial details that would enable effective query understanding. When working with specialized domains like machine learning experiment tracking, generic descriptions like "user seeks information about tracking" fail to capture the specific W&B features, user goals, and pain points that matter for system improvement.

Custom summarization transforms vague descriptions into precise, actionable insights. Instead of "user requests assistance with tool integration," we can generate "user is configuring W&B Artifacts for model versioning in PyTorch workflows." This precision is critical for building clusters that truly reflect how users interact with your platform.

Domain-specific summaries enable us to:

Capture exact features users are working with (Artifacts, Configs, Reports)
Identify specific goals and pain points rather than generic categories
Reveal usage patterns that generic summaries obscure
Create foundations for more targeted system improvements

What You'll Learn¶

In this notebook, you'll discover how to:

Build Custom Summary Models
- Design specialized prompts that extract domain-specific information
- Implement length constraints for focused, consistent summaries
- Replace Kura's default summarization with your custom approach
Compare Summarization Approaches
- Analyze the limitations of generic vs. domain-specific summaries
- See how improved summaries change clustering outcomes
- Understand the impact of summary quality on cluster interpretability
Generate Enhanced Clusters
- Apply custom summaries to create more representative topic groups
- Configure clustering parameters for optimal domain-specific results
- Extract actionable insights about user behavior patterns

What You'll Discover¶

By the end of this notebook, you'll transform your seven generic clusters into three highly actionable categories: Access Controls (data export/security), Deployment (service integration/auth), and Experiment Management (artifacts/visualization/multi-GPU). This dramatic improvement in cluster quality—from vague topics to specific, actionable user needs—will provide the foundation for building production classifiers in the next notebook.

The Power of Domain-Specific Clustering¶

While generic clustering tells you "what" users are asking about, domain-specific clustering reveals "why" and "how" they're struggling. This shift from surface-level topics to deep user intent understanding is what enables you to build targeted solutions rather than generic improvements.

By the end of this series, you'll have a complete framework for turning raw user queries into systematic, data-driven RAG improvements that address real user needs rather than perceived ones.

Creating a Custom Summary Model¶

To address the limitations we identified in our default summaries, we'll now implement our own custom summary model specific to Weights & Biases queries. By replacing the generic summarization approach with a domain-tailored solution, we can generate summaries that precisely capture the tools, features, and goals relevant to W&B users.

The WnBSummaryModel class we'll create extends Kura's base SummaryModel with a specialized prompt that instructs the model to:

Identify specific W&B features mentioned in the query (e.g., Artifacts, Configs, Reports)
Clearly state the problem the user is trying to solve
Format responses concisely (25 words or less) to ensure summaries remain focused

This approach generates summaries that are not only more informative but also more consistent, making them ideal building blocks for meaningful clustering. Let's implement our custom model and see how it transforms our understanding of user query patterns.

Loading in Conversation¶

Let's first start by loading in our conversations and parsing it into a list of Conversation objects that Kura can work with

In [4]:

  Copied!     
 
from kura import CheckpointManager, Conversation

checkpoint_manager = CheckpointManager("./checkpoints", enabled=True)
conversations = checkpoint_manager.load_checkpoint("conversations.jsonl", Conversation)
from kura import CheckpointManager, Conversation checkpoint_manager = CheckpointManager("./checkpoints", enabled=True) conversations = checkpoint_manager.load_checkpoint("conversations.jsonl", Conversation)

Let's now try to see how our default summaries look like

In [6]:

  Copied!     
 
from kura.summarisation import SummaryModel
from rich import print as rprint

summaries = await SummaryModel().summarise(conversations[:2])
for summary in summaries:
    rprint(summary)
from kura.summarisation import SummaryModel from rich import print as rprint summaries = await SummaryModel().summarise(conversations[:2]) for summary in summaries: rprint(summary) 

Summarising 2 conversations: 100%|██████████| 2/2 [00:02<00:00,  1.03s/it]

ConversationSummary(
    summary='The user is seeking information on how to track machine learning experiments using a specific tool, 
including code examples and steps involved.',
    request="The user's overall request for the assistant is to provide guidance on experiment tracking in machine 
learning.",
    topic=None,
    languages=['english', 'python'],
    task='The task is to explain how to track machine learning experiments with code examples.',
    concerning_score=1,
    user_frustration=1,
    assistant_errors=None,
    chat_id='5e878c76-25c1-4bad-8cae-6a40ca4c8138',
    metadata={'conversation_turns': 1, 'query_id': '5e878c76-25c1-4bad-8cae-6a40ca4c8138'},
    embedding=None
)

ConversationSummary(
    summary='Bayesian optimization is a hyperparameter tuning technique that uses a surrogate function for informed
search, contrasting with grid and random search methods.',
    request="The user's overall request for the assistant is to explain Bayesian optimization and its 
implementation for hyperparameter tuning.",
    topic=None,
    languages=['english', 'python'],
    task='The task is to provide information on Bayesian optimization and its application in hyperparameter 
tuning.',
    concerning_score=1,
    user_frustration=1,
    assistant_errors=None,
    chat_id='d7b77e8a-e86c-4953-bc9f-672618cdb751',
    metadata={'conversation_turns': 1, 'query_id': 'd7b77e8a-e86c-4953-bc9f-672618cdb751'},
    embedding=None
)

Looking at these default summaries, we can identify several key limitations that prevent them from being truly useful for clustering W&B-specific queries:

Problems with Default Summaries

Lack of Specificity: The first summary refers to "a specific tool" rather than explicitly naming Weights & Biases, missing the opportunity to highlight the domain context.
Missing Feature Details: Neither summary identifies which specific W&B features the users are interested in (experiment tracking, Bayesian optimization for hyperparameter tuning), which would be crucial for meaningful clustering.

These generic summaries would lead to clusters based primarily on query structure ("users asking for information") rather than meaningful W&B feature categories or user goals.

By defining our own summarisation model, we can address these limitations and cluster our user queries based off the specific problems and features they are trying to use.

Defining Our New Summary Model¶

Let's now define a new WnBSummaryModel which will help address the shortcomings of the default summarisation model.

We'll do so by modifying the summarise_conversation method so that our summaries can become more precise and feature-focused. This allows us to better reflect how users interact with Weights and Biases and in turn translate to more representative clusters

In [41]:

  Copied!     
 
from kura.types import Conversation, ConversationSummary
from kura.summarisation import SummaryModel, GeneratedSummary
import instructor


class WnBSummaryModel(SummaryModel):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    async def summarise_conversation(
        self, conversation: Conversation
    ) -> ConversationSummary:
        
        client = instructor.from_provider("openai/gpt-4o-mini", async_client=True)
        async with self.semaphore:
            resp = await client.chat.completions.create(
                model="gpt-4.1",
                messages=[
                    {
                        "role": "user",
                        "content": """
Analyze the user's query and the retrieved Weights and Biases documentation to provide a focused summary.

In your response:

1. Identify the specific W&B features being used, such as:
   - Experiment tracking and logging
   - Hyperparameter optimization
   - Model registry and versioning
   - Artifact management
   - Reports and visualization
   - Multi-GPU/distributed training

2. Describe their concrete technical goal (e.g., "setting up experiment tracking across multiple GPUs" rather than just "using experiment tracking")

Format your response in 20-25 words following:

For clear technical goals:
"User needs help with [specific W&B feature] to [concrete technical goal], specifically [implementation detail/blocker]."

For general queries:
"User is asking about [W&B concept/feature] in the context of [relevant ML workflow/task]."

Reference the context below to identify the exact W&B functionality and technical requirements:
<context>
{{ context }}
</context>

Focus on technical specifics rather than general descriptions.
""",
                    },
                ],
                response_model=GeneratedSummary,
                context={"context": conversation.messages[0].content},
            )

            return ConversationSummary(
                chat_id=conversation.chat_id,
                summary=resp.summary,
                metadata={
                    "conversation_turns": len(conversation.messages),
                },
            )
from kura.types import Conversation, ConversationSummary from kura.summarisation import SummaryModel, GeneratedSummary import instructor class WnBSummaryModel(SummaryModel): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) async def summarise_conversation( self, conversation: Conversation ) -> ConversationSummary: client = instructor.from_provider("openai/gpt-4o-mini", async_client=True) async with self.semaphore: resp = await client.chat.completions.create( model="gpt-4.1", messages=[ { "role": "user", "content": """ Analyze the user's query and the retrieved Weights and Biases documentation to provide a focused summary. In your response: 1. Identify the specific W&B features being used, such as: - Experiment tracking and logging - Hyperparameter optimization - Model registry and versioning - Artifact management - Reports and visualization - Multi-GPU/distributed training 2. Describe their concrete technical goal (e.g., "setting up experiment tracking across multiple GPUs" rather than just "using experiment tracking") Format your response in 20-25 words following: For clear technical goals: "User needs help with [specific W&B feature] to [concrete technical goal], specifically [implementation detail/blocker]." For general queries: "User is asking about [W&B concept/feature] in the context of [relevant ML workflow/task]." Reference the context below to identify the exact W&B functionality and technical requirements:  {{ context }}  Focus on technical specifics rather than general descriptions. """, }, ], response_model=GeneratedSummary, context={"context": conversation.messages[0].content}, ) return ConversationSummary( chat_id=conversation.chat_id, summary=resp.summary, metadata={ "conversation_turns": len(conversation.messages), }, )

We can now see the generated summaries by calling the summarise method below. We'll be using the same conversations above which we generated summaries for.

In [42]:

  Copied!     
 
summaries = await WnBSummaryModel().summarise(conversations[:2])
for summary in summaries:
    rprint(summary)
summaries = await WnBSummaryModel().summarise(conversations[:2]) for summary in summaries: rprint(summary) 

Summarising 2 conversations: 100%|██████████| 2/2 [00:02<00:00,  1.44s/it]

ConversationSummary(
    summary='User needs help with W&B experiment tracking to record hyperparameters, log training metrics, and 
store model artifacts for ML experiments.',
    request=None,
    topic=None,
    languages=None,
    task=None,
    concerning_score=None,
    user_frustration=None,
    assistant_errors=None,
    chat_id='5e878c76-25c1-4bad-8cae-6a40ca4c8138',
    metadata={'conversation_turns': 1},
    embedding=None
)

ConversationSummary(
    summary="User needs help with W&B's hyperparameter optimization feature to implement Bayesian optimization for 
tuning model hyperparameters, specifically setting up the search space, performance metric, and run limit.",
    request=None,
    topic=None,
    languages=None,
    task=None,
    concerning_score=None,
    user_frustration=None,
    assistant_errors=None,
    chat_id='d7b77e8a-e86c-4953-bc9f-672618cdb751',
    metadata={'conversation_turns': 1},
    embedding=None
)

Clustering with Enhanced Summaries¶

Now that we've developed a more domain-specific summarization approach tailored to the Weights & Biases ecosystem, we can apply these improved summaries to our clustering process.

Our custom WnBSummaryModel captures the specific features, workflows, and user intentions that were missing in the default summaries, providing a stronger foundation for meaningful topic discovery.

This will help us to reveal patterns in feature usage, common pain points and documentation gaps that might have been obscured in our analysis in our previous notebook. Let's see this in action below.

In [48]:

  Copied!     
 
from kura import (
    summarise_conversations, 
    generate_base_clusters_from_conversation_summaries, 
    reduce_clusters_from_base_clusters,
    reduce_dimensionality_from_clusters,
    CheckpointManager
)
from kura.cluster import ClusterModel
from kura.meta_cluster import MetaClusterModel
from kura.dimensionality import HDBUMAP



async def analyze_conversations(conversations, checkpoint_manager):
    # Set up models
    summary_model = WnBSummaryModel()
    cluster_model = ClusterModel()
    meta_cluster_model = MetaClusterModel(max_clusters=4)
    dimensionality_model = HDBUMAP()
    
    # Run pipeline steps
    summaries = await summarise_conversations(
        conversations, 
        model=summary_model, 
        checkpoint_manager=checkpoint_manager
    )
    
    clusters = await generate_base_clusters_from_conversation_summaries(
        summaries, 
        model=cluster_model, 
        checkpoint_manager=checkpoint_manager
    )
    
    reduced_clusters = await reduce_clusters_from_base_clusters(
        clusters, 
        model=meta_cluster_model, 
        checkpoint_manager=checkpoint_manager
    )
    
    projected = await reduce_dimensionality_from_clusters(
        reduced_clusters,   
        model=dimensionality_model, 
        checkpoint_manager=checkpoint_manager
    )
    
    return projected

checkpoint_manager = CheckpointManager("./checkpoints_2", enabled=True)
checkpoint_manager.save_checkpoint("conversations.jsonl", conversations)
clusters = await analyze_conversations(conversations, checkpoint_manager=checkpoint_manager)
from kura import ( summarise_conversations, generate_base_clusters_from_conversation_summaries, reduce_clusters_from_base_clusters, reduce_dimensionality_from_clusters, CheckpointManager ) from kura.cluster import ClusterModel from kura.meta_cluster import MetaClusterModel from kura.dimensionality import HDBUMAP async def analyze_conversations(conversations, checkpoint_manager): # Set up models summary_model = WnBSummaryModel() cluster_model = ClusterModel() meta_cluster_model = MetaClusterModel(max_clusters=4) dimensionality_model = HDBUMAP() # Run pipeline steps summaries = await summarise_conversations( conversations, model=summary_model, checkpoint_manager=checkpoint_manager ) clusters = await generate_base_clusters_from_conversation_summaries( summaries, model=cluster_model, checkpoint_manager=checkpoint_manager ) reduced_clusters = await reduce_clusters_from_base_clusters( clusters, model=meta_cluster_model, checkpoint_manager=checkpoint_manager ) projected = await reduce_dimensionality_from_clusters( reduced_clusters, model=dimensionality_model, checkpoint_manager=checkpoint_manager ) return projected checkpoint_manager = CheckpointManager("./checkpoints_2", enabled=True) checkpoint_manager.save_checkpoint("conversations.jsonl", conversations) clusters = await analyze_conversations(conversations, checkpoint_manager=checkpoint_manager)

In [49]:

  Copied!     
 
# Get top-level clusters (those without parents)
parent_clusters = [cluster for cluster in clusters if cluster.parent_id is None]

# Format each cluster's info with name, description and number of chats
formatted_clusters = []
for parent in parent_clusters:
    
    # Add parent cluster info
    cluster_info = (
        f"[bold]({parent.id}) {parent.name}[/bold] : {parent.description} : {len(parent.chat_ids)}\n"
    )
    
    # Get and format child clusters
    child_clusters = [c for c in clusters if c.parent_id == parent.id]
    for child in child_clusters:
        cluster_info += f"\n  • [bold]{child.name}[/bold] : {child.description} : {len(child.chat_ids)}"
        child_child_clusters = [c for c in clusters if c.parent_id == child.id]
        for child_child in child_child_clusters:
            if child_child.parent_id == child.id:
                cluster_info += f"\n    + [bold]{child_child.name}[/bold] : {child_child.description} : {len(child_child.chat_ids)}"
        
        cluster_info += "\n\n"
    
    formatted_clusters.append(cluster_info)
    formatted_clusters.append("\n====\n")

# Join with newlines and print
rprint("\n\n".join(formatted_clusters))
# Get top-level clusters (those without parents) parent_clusters = [cluster for cluster in clusters if cluster.parent_id is None] # Format each cluster's info with name, description and number of chats formatted_clusters = [] for parent in parent_clusters: # Add parent cluster info cluster_info = ( f"[bold]({parent.id}) {parent.name}[/bold] : {parent.description} : {len(parent.chat_ids)}\n" ) # Get and format child clusters child_clusters = [c for c in clusters if c.parent_id == parent.id] for child in child_clusters: cluster_info += f"\n • [bold]{child.name}[/bold] : {child.description} : {len(child.chat_ids)}" child_child_clusters = [c for c in clusters if c.parent_id == child.id] for child_child in child_child_clusters: if child_child.parent_id == child.id: cluster_info += f"\n + [bold]{child_child.name}[/bold] : {child_child.description} : {len(child_child.chat_ids)}" cluster_info += "\n\n" formatted_clusters.append(cluster_info) formatted_clusters.append("\n====\n") # Join with newlines and print rprint("\n\n".join(formatted_clusters))

(3943254bfb5a471385aeadcc745e478b) Manage audio files and dataset versioning : Users organized audio files for
classification analysis and managed dataset versioning in W&B. They focused on tasks such as loading metadata and
updating datasets for effective machine learning workflows. : 37

• Organize audio files for classification analysis : The user prepared and structured audio data for
classification tasks. This included tasks like loading, merging, and synchronizing metadata for effective analysis.
: 2
+ Prepare audio data for classification analysis : The user processed and organized audio files for effective
classification. This involved loading, merging DataFrames, and synchronizing metadata for analysis. : 2

• Assist with dataset versioning in W&B : Users sought help with managing dataset versioning and artifacts in
W&B. They focused on logging, tracking, and updating datasets for improved reproducibility in machine learning
workflows. : 35
+ Help me manage dataset versioning in W&B : Users sought assistance with handling dataset versioning and
artifacts in W&B. They specifically focused on logging, tracking, and updating datasets for better reproducibility
in their machine learning workflows. : 35

====

(92d75e32975c4651a2ab46ccb8a83fd9) Optimize machine learning experiments and resources : Users explored methods to
optimize and visualize machine learning experiments through Weights & Biases. They focused on hyperparameter
tuning, multi-GPU training, and maximizing the tool's utilization for improved performance and efficiency. : 449

• Optimize and visualize machine learning experiments with W&B : Users explored methods to log and visualize
machine learning experiments using Weights & Biases. They focused on customizing visualizations, handling data
formats, and enhancing tracking for better analysis. : 195
+ Explore data visualization with W&B Tables : Users investigated how to visualize diverse media types using
wandb.Table. They aimed to manage and analyze different data formats effectively within the W&B Tables interface. :
4
+ Help me log and visualize experiments in W&B : Users requested guidance on visualizing and logging various
aspects of machine learning experiments using Weights & Biases (W&B). They focused on customizing visualizations,
tracking metrics, managing prompt configurations, and handling multi-class confusion matrices. : 71
+ Enhance machine learning experiment tracking with W&B : Users optimized and troubleshot experiment tracking
with Weights & Biases, focusing on installation, configuration, and logging. They explored core features to improve
ML workflows, ensuring better collaboration and data analysis. : 120

• Assist with hyperparameter optimization using Weights & Biases : Users requested help with optimizing
hyperparameters through Weights & Biases. They sought guidance on configuring sweeps, automating processes, and
analyzing results for better performance. : 60
+ Help optimize hyperparameter sweeps using W&B : Users requested assistance in programmatically accessing and
analyzing hyperparameter optimization results from W&B sweeps. They sought guidance on optimizing configurations
for efficient parallel training across multiple GPUs and CPUs. : 23
+ Guide hyperparameter optimization using Weights & Biases : Users sought assistance with hyperparameter tuning
using Weights & Biases and related tools. They requested support for configuring sweeps, automating processes, and
troubleshooting to enhance performance. : 37

• Assist with multi-GPU training and optimization : Users explored methods for effective multi-GPU distributed
training using HuggingFace while seeking optimization of GPU resources during model training. They focused on job
management, script arguments, and resource monitoring to enhance training efficiency. : 9
+ Guide multi-GPU distributed training with HuggingFace : Users requested help on setting up multi-GPU
distributed training using HuggingFace Accelerate. They focused on launching jobs and managing script arguments
across various hardware configurations. : 3
+ Optimize GPU resources for model training : Users sought assistance in optimizing GPU usage and memory during
machine learning training. They emphasized the integration of W&B for monitoring and enhancing the efficiency of
training processes. : 6

• Assist in maximizing Weights & Biases utilization : Users sought guidance on using Weights & Biases to manage
experiments and data effectively. They aimed to optimize their machine learning workflows through understanding its
features, configurations, and best practices. : 185
+ Guide effective use of Weights & Biases : Users sought assistance in leveraging Weights & Biases for managing
experiments and data effectively. They aimed to enhance their machine learning workflows by understanding features,
configurations, and best practices related to experiment tracking, data manipulation, and artifact management. :
185

====

(126ad2b1c8054e7b93d0dc652841623c) Enhance machine learning project collaboration with W&B : Users integrated
Weights & Biases into machine learning workflows while enhancing team collaboration features. They focused on
optimizing tracking, management settings, and collaborative capabilities for effective project execution. : 74

• Integrate W&B with machine learning frameworks : Users received help integrating W&B tracking into their
machine learning workflows, specifically with PyTorch and TensorFlow. They also learned to manage W&B settings and
SageMaker configurations for secure deployments and effective tracking. : 56
+ Integrate W&B tracking into ML workflows : Users received assistance in integrating W&B experiment tracking
into their machine learning workflows using both PyTorch and TensorFlow. They learned to log metrics,
hyperparameters, and artifacts effectively throughout the model training process. : 19
+ Assist with W&B and SageMaker setup tasks : Users needed help configuring sharing settings for W&B reports
and managing API keys. They also sought assistance in setting up secure IAM roles for SageMaker to ensure safe
deployment and access management. : 37

• Optimize team collaboration and management in W&B : Users explored ways to enhance collaboration and management
features in Weights & Biases. They focused on configuring roles, permissions, and various enterprise capabilities
for improved project outcomes. : 18
+ Enhance team collaboration in Weights & Biases : Users sought to optimize collaborative features in Weights &
Biases for better project management. They focused on configuring roles, permissions, and team settings to improve
collaboration outcomes. : 10
+ Help with team management and enterprise features : Users sought assistance in managing team roles and
permissions within W&B, including inquiries about enterprise features specific to W&B Server. They explored topics
such as access control, secure storage, and distinctions in user management across membership tiers. : 8

====

Conclusion¶

What You Learned¶

In this notebook, you learned how to create domain-specific summarization models that dramatically improve clustering quality. You discovered how to:

Create custom summary models using specialized prompts tailored to your domain
Replace generic descriptions with precise, feature-specific summaries
Configure clustering parameters to achieve optimal grouping results
Compare clustering outcomes between default and custom approaches

What We Accomplished¶

We built a custom WnBSummaryModel that addressed the key limitations from our initial clustering. By implementing domain-specific prompts that focus on W&B features and user intentions, we transformed our clustering results from generic topic groups into three highly actionable categories:

Optimize machine learning experiments and resources (449 conversations) - The largest cluster covering users exploring experiment tracking, hyperparameter optimization, multi-GPU training, and maximizing W&B utilization for improved ML performance
Enhance machine learning project collaboration with W&B (74 conversations) - Users integrating W&B with PyTorch/TensorFlow workflows and optimizing team collaboration features including roles, permissions, and enterprise capabilities
Manage audio files and dataset versioning (37 conversations) - Users organizing audio data for classification analysis and managing dataset versioning workflows in W&B

This represents a significant upgrade from our previous clusters, providing much more specific and actionable information about user needs. The improved summaries eliminated the vagueness of descriptions like "user seeks information about tracking" and replaced them with precise insights about specific W&B workflows, optimization goals, and collaboration requirements.

Next: Building Production Classifiers¶

While our improved clustering gives us deep insights into historical query patterns, we need a way to act on these insights in real-time production environments. In the next notebook, "Classifiers", we'll bridge the gap between discovery and action by:

Building production-ready classifiers using the instructor library that achieve 90.9% accuracy through systematic prompt engineering
Creating automated labeling workflows with weak supervision to efficiently generate labeled datasets for training
Focusing on three high-impact categories - artifacts (20% of queries), integrations (15%), and visualizations (14%) - that account for roughly 50% of all user conversations
Applying classifiers at scale to understand true query distributions and identify exactly where to focus improvement efforts

This classifier will enable you to automatically categorize incoming queries in real-time, detect production drift when certain query types surge, and intelligently route questions to specialized retrieval systems. More importantly, you'll move from "we think users struggle with X" to "we know 20% of users need help with artifacts, 15% with integrations, and 14% with visualizations—and we can automatically detect and route these queries for specialized handling."