Better Summaries: Building Domain-Specific Clustering¶
Series Overview: This is the second notebook in our three-part series on systematically analyzing and improving RAG systems. In the first notebook, we discovered query patterns but found limitations with generic summaries. Now we'll fix that.
Prerequisites: Complete "1. Cluster Conversations" notebook first. You'll need the same dependencies and
OPENAI_API_KEY
from the previous notebook.
# Install kura in Google Colab
!pip install kura
# Make sure you've setup your `OPENAI_API_KEY``
# os.environ['OPENAI_API_KEY'] = <your api key here>
import os
from google.colab import userdata
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
# Create data directory and download dataset
DATA_DIRECTORY = './data'
CHECKPOINT_DIRECTORY = './checkpoints_2'
!mkdir -p {DATA_DIRECTORY}
!curl -o {DATA_DIRECTORY}/conversations.json https://usekura.xyz/assets/conversations.json
# Curl the Checkpoints
CHECKPOINT_DIRECTORY = './checkpoints_2'
os.makedirs(CHECKPOINT_DIRECTORY, exist_ok=True)
!curl -o {CHECKPOINT_DIRECTORY}/clusters.jsonl https://usekura.xyz/assets/notebooks/checkpoints_2/clusters.jsonl
!curl -o {CHECKPOINT_DIRECTORY}/conversations.jsonl https://usekura.xyz/assets/notebooks/checkpoints_2/conversations.jsonl
!curl -o {CHECKPOINT_DIRECTORY}/dimensionality.jsonl https://usekura.xyz/assets/notebooks/checkpoints_2/dimensionality.jsonl
!curl -o {CHECKPOINT_DIRECTORY}/meta_clusters.jsonl https://usekura.xyz/assets/notebooks/checkpoints_2/meta_clusters.jsonl
!curl -o {CHECKPOINT_DIRECTORY}/summaries.jsonl https://usekura.xyz/assets/notebooks/checkpoints_2/summaries.jsonl
Reproducing Results: To reproduce the exact results from this notebook, the first cell downloads pre-computed checkpoints from our server. These checkpoints contain the intermediate results from each step of the clustering pipeline, allowing you to follow along without waiting for the computationally expensive embedding and clustering operations to complete.
To download our precomputed checkpoints, make sure that you curl the
checkpoints_2
directory
Why This Matters¶
The generic summaries from our initial clustering missed crucial details that would enable effective query understanding. When working with specialized domains like machine learning experiment tracking, generic descriptions like "user seeks information about tracking" fail to capture the specific W&B features, user goals, and pain points that matter for system improvement.
Custom summarization transforms vague descriptions into precise, actionable insights. Instead of "user requests assistance with tool integration," we can generate "user is configuring W&B Artifacts for model versioning in PyTorch workflows." This precision is critical for building clusters that truly reflect how users interact with your platform.
Domain-specific summaries enable us to:
- Capture exact features users are working with (Artifacts, Configs, Reports)
- Identify specific goals and pain points rather than generic categories
- Reveal usage patterns that generic summaries obscure
- Create foundations for more targeted system improvements
What You'll Learn¶
In this notebook, you'll discover how to:
Extend Kura's summary functionality
- Design specialized prompts that extract domain-specific information
- Evaluate new domain focused approaches vs generic ones
- Replace Kura's default summarization with your custom approach
Compare Summarization Approaches
- Analyze the limitations of generic vs. domain-specific summaries
- See how improved summaries change clustering outcomes
- Understand the impact of summary quality on cluster interpretability
Generate Enhanced Clusters
- Apply custom summaries to create more representative topic groups
- Configure clustering parameters for optimal domain-specific results
- Extract actionable insights about user behavior patterns
What You'll Discover¶
By the end of this notebook, you'll transform your seven generic clusters into three highly actionable categories: Artifacts, Visualisations, and Integrations. This dramatic improvement in cluster quality—from vague topics to specific, actionable user needs—will provide the foundation for building production classifiers in the next notebook.
The Power of Domain-Specific Clustering¶
While generic clustering tells you "what" users are asking about, domain-specific clustering reveals "why" and "how" they're struggling. This shift from surface-level topics to deep user intent understanding is what enables you to build targeted solutions rather than generic improvements.
By the end of this series, you'll have a complete framework for turning raw user queries into systematic, data-driven RAG improvements that address real user needs rather than perceived ones.
Creating Domain Specific Summaries¶
To address the limitations that we identified in our default summaries, we'll now see how it easy it is to replace our generic summarisation approach with a domain-specific prompt which will allow us to generate summaries that allow us to precisely capture the tools, features and goals relevant to W&B users.
With this new prompt, our goal is to
- Identify specific W&B features mentioned in the query (e.g., Artifacts, Configs, Reports)
- Clearly state the problem the user is trying to solve
- Format responses concisely (25 words or less) to ensure summaries remain focused
This approach generates summaries that are not only more informative but also more consistent, making them ideal building blocks for meaningful clustering. Let's implement our custom model and see how it transforms our understanding of user query patterns.
Loading in Conversation¶
Let's first start by loading in our conversations and parsing it into a list of Conversation
objects that Kura
can work with
import json
from kura.types import Message, Conversation
from datetime import datetime
def process_query_obj(obj: dict):
return Conversation(
chat_id=obj["query_id"],
created_at=datetime.now(),
messages=[
Message(
created_at=datetime.now(),
role="user",
content=f"""
User Query: {obj["query"]}
Retrieved Information : {obj["matching_document"]}
""",
)
],
metadata={"query_id": obj["query_id"]},
)
with open(f"{DATA_DIRECTORY}/conversations.json") as f:
conversations_raw = json.load(f)
conversations = [process_query_obj(obj) for obj in conversations_raw]
We'll then use the JSONL
checkpoint manager to help store our conversations to our checkpoint
from kura.checkpoints import JSONLCheckpointManager
checkpoint_manager = JSONLCheckpointManager(CHECKPOINT_DIRECTORY, enabled=True)
checkpoint_manager.save_checkpoint("conversations",conversations)
conversations = checkpoint_manager.load_checkpoint("conversations", Conversation)
Let's now try to see how our default summaries look like
from kura.summarisation import SummaryModel
from rich import print as rprint
summaries = await SummaryModel().summarise(conversations[:2])
for summary in summaries:
rprint(summary)
Summarising 2 conversations: 100%|██████████| 2/2 [00:02<00:00, 1.03s/it]
ConversationSummary( summary='The user is seeking information on how to track machine learning experiments using a specific tool, including code examples and steps involved.', request="The user's overall request for the assistant is to provide guidance on experiment tracking in machine learning.", topic=None, languages=['english', 'python'], task='The task is to explain how to track machine learning experiments with code examples.', concerning_score=1, user_frustration=1, assistant_errors=None, chat_id='5e878c76-25c1-4bad-8cae-6a40ca4c8138', metadata={'conversation_turns': 1, 'query_id': '5e878c76-25c1-4bad-8cae-6a40ca4c8138'}, embedding=None )
ConversationSummary( summary='Bayesian optimization is a hyperparameter tuning technique that uses a surrogate function for informed search, contrasting with grid and random search methods.', request="The user's overall request for the assistant is to explain Bayesian optimization and its implementation for hyperparameter tuning.", topic=None, languages=['english', 'python'], task='The task is to provide information on Bayesian optimization and its application in hyperparameter tuning.', concerning_score=1, user_frustration=1, assistant_errors=None, chat_id='d7b77e8a-e86c-4953-bc9f-672618cdb751', metadata={'conversation_turns': 1, 'query_id': 'd7b77e8a-e86c-4953-bc9f-672618cdb751'}, embedding=None )
Looking at these default summaries, we can identify several key limitations that prevent them from being truly useful for clustering W&B-specific queries:
Problems with Default Summaries
Lack of Specificity: The first summary refers to "a specific tool" rather than explicitly naming Weights & Biases, missing the opportunity to highlight the domain context.
Missing Feature Details: Neither summary identifies which specific W&B features the users are interested in (experiment tracking, Bayesian optimization for hyperparameter tuning), which would be crucial for meaningful clustering.
These generic summaries would lead to clusters based primarily on query structure ("users asking for information") rather than meaningful W&B feature categories or user goals.
By defining our own summarisation model, we can address these limitations and cluster our user queries based off the specific problems and features they are trying to use.
Defining Our New Summary Model¶
Notice how simple this is: We just need to modify the prompt and pass it to Kura's existing summarise_conversations
function. This approach makes the code easier to work with since we can configure the behavior through parameters rather than creating complex class hierarchies
By customizing the prompt in the summarise_conversations
function, our summaries become more precise and feature-focused. This allows us to better reflect how users interact with Weights and Biases and in turn translate to more representative clusters.
from kura.summarisation import summarise_conversations
WNB_SUMMARY_PROMPT = """
Analyze the user's query and the retrieved Weights and Biases documentation to provide a focused summary.
In your response:
1. Identify the specific W&B features being used, such as:
- Experiment tracking and logging
- Hyperparameter optimization
- Model registry and versioning
- Artifact management
- Reports and visualization
- Multi-GPU/distributed training
2. Describe their concrete technical goal (e.g., "setting up experiment tracking across multiple GPUs" rather than just "using experiment tracking")
Format your response in 20-25 words following:
For clear technical goals:
"User needs help with [specific W&B feature] to [concrete technical goal], specifically [implementation detail/blocker]."
For general queries:
"User is asking about [W&B concept/feature] in the context of [relevant ML workflow/task]."
Reference the context below to identify the exact W&B functionality and technical requirements:
<messages>
{% for message in conversation.messages %}
<message>{{message.role}}: {{message.content}}</message>
{% endfor %}
</messages>
Focus on technical specifics rather than general descriptions.
Also make sure to extract
- **Languages**: What are the main languages of the conversation? Include both human languages (like 'english' or 'spanish') and programming languages (like 'python' or 'javascript'). Use lowercase full names.
- **Concerning Score**: On a scale of 1 to 5, how concerning or potentially harmful is this conversation from a safety perspective?
- 1: Not concerning at all. Completely benign conversation.
- 2: Slightly concerning, but likely harmless.
- 3: Moderately concerning. May warrant a closer look.
- 4: Very concerning. Likely needs review.
- 5: Extremely concerning. Immediate review needed.
- **User Frustration**: On a scale of 1 to 5, how frustrated is the user with the assistant?
- 1: Not frustrated at all. The user is happy with the assistant.
- 2: Slightly frustrated. The user is slightly annoyed with the assistant.
- 3: Moderately frustrated. The user is moderately annoyed with the assistant.
- 4: Very frustrated. The user is very annoyed with the assistant.
- 5: Extremely frustrated. The user is extremely annoyed with the assistant.
- **Assistant Errors**: What errors did the assistant make?
Example:
- "Responses were too long and verbose"
- "Misunderstood the user's intent or request"
- "Used wrong tool for the task"
- "Ignored user's stated preferences or constraints"
- "Provided outdated or incorrect information"
- "Failed to maintain conversation context"
"""
We can now see the generated summaries below by passing in our new prompt into the summarise_conversations function. We'll be using the same conversations above which we generated summaries for.
from kura.summarisation import SummaryModel
from rich import print as rprint
summaries = await summarise_conversations(
conversations=conversations[:2],
prompt=WNB_SUMMARY_PROMPT,
model=SummaryModel()
)
for summary in summaries:
rprint(summary)
Summarising 2 conversations: 100%|██████████| 2/2 [00:02<00:00, 1.13s/it]
ConversationSummary( summary="User needs help with experiment tracking to set up logging of hyperparameters and metrics during model training, specifically using W&B's logging functions.", request="Analyze the user's query and the retrieved Weights and Biases documentation to provide a focused summary.", topic='machine learning', languages=['english', 'python'], task='summarize user query and documentation', concerning_score=1, user_frustration=1, assistant_errors=None, chat_id='5e878c76-25c1-4bad-8cae-6a40ca4c8138', metadata={'conversation_turns': 1, 'query_id': '5e878c76-25c1-4bad-8cae-6a40ca4c8138'}, embedding=None )
ConversationSummary( summary='User needs help with hyperparameter optimization to implement Bayesian optimization using Weights and Biases, specifically regarding the setup process and input requirements.', request="Analyze the user's query and the retrieved Weights and Biases documentation to provide a focused summary.", topic='machine learning', languages=['english', 'python'], task='summarize user query and documentation', concerning_score=1, user_frustration=1, assistant_errors=None, chat_id='d7b77e8a-e86c-4953-bc9f-672618cdb751', metadata={'conversation_turns': 1, 'query_id': 'd7b77e8a-e86c-4953-bc9f-672618cdb751'}, embedding=None )
from kura import ( generate_base_clusters_from_conversation_summaries, reduce_clusters_from_base_clusters, reduce_dimensionality_from_clusters, ) from kura.checkpoints import JSONLCheckpointManager from kura.cluster import ClusterDescriptionModel from kura.meta_cluster import MetaClusterModel from kura.dimensionality import HDBUMAP
async def analyze_conversations(conversations, checkpoint_manager): # Set up models summary_model = SummaryModel() cluster_model = ClusterDescriptionModel() meta_cluster_model = MetaClusterModel(max_clusters=4) dimensionality_model = HDBUMAP()
# Run pipeline steps
summaries = await summarise_conversations(
conversations, model=summary_model, checkpoint_manager=checkpoint_manager, prompt=WNB_SUMMARY_PROMPT
)
clusters = await generate_base_clusters_from_conversation_summaries(
summaries, model=cluster_model, checkpoint_manager=checkpoint_manager
)
reduced_clusters = await reduce_clusters_from_base_clusters(
clusters, model=meta_cluster_model, checkpoint_manager=checkpoint_manager
)
projected = await reduce_dimensionality_from_clusters(
reduced_clusters,
model=dimensionality_model,
checkpoint_manager=checkpoint_manager,
)
return projected
checkpoint_manager = JSONLCheckpointManager(CHECKPOINT_DIRECTORY, enabled=True) checkpoint_manager.save_checkpoint("conversations", conversations) clusters = await analyze_conversations( conversations, checkpoint_manager=checkpoint_manager )
from kura import (
generate_base_clusters_from_conversation_summaries,
reduce_clusters_from_base_clusters,
reduce_dimensionality_from_clusters,
CheckpointManager,
)
from kura.cluster import ClusterDescriptionModel
from kura.meta_cluster import MetaClusterModel
from kura.dimensionality import HDBUMAP
async def analyze_conversations(conversations, checkpoint_manager):
# Set up models
summary_model = SummaryModel()
cluster_model = ClusterDescriptionModel()
meta_cluster_model = MetaClusterModel(max_clusters=4)
dimensionality_model = HDBUMAP()
# Run pipeline steps
summaries = await summarise_conversations(
conversations, model=summary_model, checkpoint_manager=checkpoint_manager, prompt=WNB_SUMMARY_PROMPT
)
clusters = await generate_base_clusters_from_conversation_summaries(
summaries, model=cluster_model, checkpoint_manager=checkpoint_manager
)
reduced_clusters = await reduce_clusters_from_base_clusters(
clusters, model=meta_cluster_model, checkpoint_manager=checkpoint_manager
)
projected = await reduce_dimensionality_from_clusters(
reduced_clusters,
model=dimensionality_model,
checkpoint_manager=checkpoint_manager,
)
return projected
checkpoint_manager = CheckpointManager(f"{CHECKPOINT_DIRECTORY}", enabled=True)
checkpoint_manager.save_checkpoint("conversations.jsonl", conversations)
clusters = await analyze_conversations(
conversations, checkpoint_manager=checkpoint_manager
)
# Get top-level clusters (those without parents)
parent_clusters = [cluster for cluster in clusters if cluster.parent_id is None]
# Format each cluster's info with name, description and number of chats
formatted_clusters = []
for parent in parent_clusters:
# Add parent cluster info
cluster_info = f"[bold]({parent.id}) {parent.name}[/bold] : {parent.description} : {len(parent.chat_ids)}\n"
# Get and format child clusters
child_clusters = [c for c in clusters if c.parent_id == parent.id]
for child in child_clusters:
cluster_info += f"\n • [bold]{child.name}[/bold] : {child.description} : {len(child.chat_ids)}"
child_child_clusters = [c for c in clusters if c.parent_id == child.id]
for child_child in child_child_clusters:
if child_child.parent_id == child.id:
cluster_info += f"\n + [bold]{child_child.name}[/bold] : {child_child.description} : {len(child_child.chat_ids)}"
cluster_info += "\n\n"
formatted_clusters.append(cluster_info)
formatted_clusters.append("\n====\n")
# Join with newlines and print
rprint("\n\n".join(formatted_clusters))
(3943254bfb5a471385aeadcc745e478b) Manage audio files and dataset versioning : Users organized audio files for classification analysis and managed dataset versioning in W&B. They focused on tasks such as loading metadata and updating datasets for effective machine learning workflows. : 37 • Organize audio files for classification analysis : The user prepared and structured audio data for classification tasks. This included tasks like loading, merging, and synchronizing metadata for effective analysis. : 2 + Prepare audio data for classification analysis : The user processed and organized audio files for effective classification. This involved loading, merging DataFrames, and synchronizing metadata for analysis. : 2 • Assist with dataset versioning in W&B : Users sought help with managing dataset versioning and artifacts in W&B. They focused on logging, tracking, and updating datasets for improved reproducibility in machine learning workflows. : 35 + Help me manage dataset versioning in W&B : Users sought assistance with handling dataset versioning and artifacts in W&B. They specifically focused on logging, tracking, and updating datasets for better reproducibility in their machine learning workflows. : 35 ==== (92d75e32975c4651a2ab46ccb8a83fd9) Optimize machine learning experiments and resources : Users explored methods to optimize and visualize machine learning experiments through Weights & Biases. They focused on hyperparameter tuning, multi-GPU training, and maximizing the tool's utilization for improved performance and efficiency. : 449 • Optimize and visualize machine learning experiments with W&B : Users explored methods to log and visualize machine learning experiments using Weights & Biases. They focused on customizing visualizations, handling data formats, and enhancing tracking for better analysis. : 195 + Explore data visualization with W&B Tables : Users investigated how to visualize diverse media types using wandb.Table. They aimed to manage and analyze different data formats effectively within the W&B Tables interface. : 4 + Help me log and visualize experiments in W&B : Users requested guidance on visualizing and logging various aspects of machine learning experiments using Weights & Biases (W&B). They focused on customizing visualizations, tracking metrics, managing prompt configurations, and handling multi-class confusion matrices. : 71 + Enhance machine learning experiment tracking with W&B : Users optimized and troubleshot experiment tracking with Weights & Biases, focusing on installation, configuration, and logging. They explored core features to improve ML workflows, ensuring better collaboration and data analysis. : 120 • Assist with hyperparameter optimization using Weights & Biases : Users requested help with optimizing hyperparameters through Weights & Biases. They sought guidance on configuring sweeps, automating processes, and analyzing results for better performance. : 60 + Help optimize hyperparameter sweeps using W&B : Users requested assistance in programmatically accessing and analyzing hyperparameter optimization results from W&B sweeps. They sought guidance on optimizing configurations for efficient parallel training across multiple GPUs and CPUs. : 23 + Guide hyperparameter optimization using Weights & Biases : Users sought assistance with hyperparameter tuning using Weights & Biases and related tools. They requested support for configuring sweeps, automating processes, and troubleshooting to enhance performance. : 37 • Assist with multi-GPU training and optimization : Users explored methods for effective multi-GPU distributed training using HuggingFace while seeking optimization of GPU resources during model training. They focused on job management, script arguments, and resource monitoring to enhance training efficiency. : 9 + Guide multi-GPU distributed training with HuggingFace : Users requested help on setting up multi-GPU distributed training using HuggingFace Accelerate. They focused on launching jobs and managing script arguments across various hardware configurations. : 3 + Optimize GPU resources for model training : Users sought assistance in optimizing GPU usage and memory during machine learning training. They emphasized the integration of W&B for monitoring and enhancing the efficiency of training processes. : 6 • Assist in maximizing Weights & Biases utilization : Users sought guidance on using Weights & Biases to manage experiments and data effectively. They aimed to optimize their machine learning workflows through understanding its features, configurations, and best practices. : 185 + Guide effective use of Weights & Biases : Users sought assistance in leveraging Weights & Biases for managing experiments and data effectively. They aimed to enhance their machine learning workflows by understanding features, configurations, and best practices related to experiment tracking, data manipulation, and artifact management. : 185 ==== (126ad2b1c8054e7b93d0dc652841623c) Enhance machine learning project collaboration with W&B : Users integrated Weights & Biases into machine learning workflows while enhancing team collaboration features. They focused on optimizing tracking, management settings, and collaborative capabilities for effective project execution. : 74 • Integrate W&B with machine learning frameworks : Users received help integrating W&B tracking into their machine learning workflows, specifically with PyTorch and TensorFlow. They also learned to manage W&B settings and SageMaker configurations for secure deployments and effective tracking. : 56 + Integrate W&B tracking into ML workflows : Users received assistance in integrating W&B experiment tracking into their machine learning workflows using both PyTorch and TensorFlow. They learned to log metrics, hyperparameters, and artifacts effectively throughout the model training process. : 19 + Assist with W&B and SageMaker setup tasks : Users needed help configuring sharing settings for W&B reports and managing API keys. They also sought assistance in setting up secure IAM roles for SageMaker to ensure safe deployment and access management. : 37 • Optimize team collaboration and management in W&B : Users explored ways to enhance collaboration and management features in Weights & Biases. They focused on configuring roles, permissions, and various enterprise capabilities for improved project outcomes. : 18 + Enhance team collaboration in Weights & Biases : Users sought to optimize collaborative features in Weights & Biases for better project management. They focused on configuring roles, permissions, and team settings to improve collaboration outcomes. : 10 + Help with team management and enterprise features : Users sought assistance in managing team roles and permissions within W&B, including inquiries about enterprise features specific to W&B Server. They explored topics such as access control, secure storage, and distinctions in user management across membership tiers. : 8 ====
Conclusion¶
What You Learned¶
In this notebook, you learned how to create domain-specific summarization models that dramatically improve clustering quality. You discovered how to:
- Create custom summary models using specialized prompts tailored to your domain
- Replace generic descriptions with precise, feature-specific summaries
- Configure clustering parameters to achieve optimal grouping results
- Compare clustering outcomes between default and custom approaches
What We Accomplished¶
We were able to address the key limitation of our initial clustering using a new custom prompt that was domain specific, focusing on W&B features and user intentions. This allowed us to transform our clsutering results from generic topic groups into highly actionable categories.
- Optimize machine learning experiments and resources (449 conversations) - The largest cluster covering users exploring experiment tracking, hyperparameter optimization, multi-GPU training, and maximizing W&B utilization for improved ML performance
- Enhance machine learning project collaboration with W&B (74 conversations) - Users integrating W&B with PyTorch/TensorFlow workflows and optimizing team collaboration features including roles, permissions, and enterprise capabilities
- Manage audio files and dataset versioning (37 conversations) - Users organizing audio data for classification analysis and managing dataset versioning workflows in W&B
This represents a significant upgrade from our previous clusters, providing much more specific and actionable information about user needs. The improved summaries eliminated the vagueness of descriptions like "user seeks information about tracking" and replaced them with precise insights about specific W&B workflows, optimization goals, and collaboration requirements.
Next: Building Production Classifiers¶
While our improved clustering gives us deep insights into historical query patterns, we need a way to act on these insights in real-time production environments. In the next notebook, "Classifiers", we'll bridge the gap between discovery and action by:
- Building production-ready classifiers using the
instructor
library to achieve ~90% accuracy through systematic prompt engineering - Creating automated labeling workflows with weak supervision to efficiently generate labeled datasets for training
- Focusing on three high-impact categories - artifacts (20% of queries), integrations (15%), and visualizations (14%) - that account for almost 50% of all user conversations
- Applying classifiers at scale to understand true query distributions and identify exactly where to focus improvement efforts
This classifier will enable you to automatically categorize incoming queries in real-time, detect production drift when certain query types surge, and intelligently route questions to specialized retrieval systems. More importantly, you'll move from "we think users struggle with X" to "we know 20% of users need help with artifacts, 15% with integrations, and 14% with visualizations—and we can automatically detect and route these queries for specialized handling."