Building Production Query Classifiers for RAG Systems¶
Series Overview: This is the final notebook in our three-part series on systematically analyzing and improving RAG systems. We've discovered patterns and improved clustering—now we'll build production-ready classifiers to act on these insights.
Prerequisites: Complete both "1. Cluster Conversations" and "2. Better Summaries" notebooks first. You'll need
instructor
andinstructor_classify
libraries installed, plus the labeled dataset from our clustering analysis.
# # Install kura in Google Colab
# !pip install kura
# !pip install instructor
# !pip install instructor git+https://github.com/jxnl/instructor-classify.git
# # Make sure you've setup your `OPENAI_API_KEY``
# # os.environ['OPENAI_API_KEY'] = <your api key here>
# import os
# from google.colab import userdata
# os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
# Create data directory and download dataset
DATA_DIRECTORY = "./data"
CHECKPOINT_DIRECTORY = "./checkpoints"
!mkdir -p {DATA_DIRECTORY}
!curl -o {DATA_DIRECTORY}/conversations.json https://usekura.xyz/assets/conversations.json
!curl -o {DATA_DIRECTORY}/labels.jsonl https://usekura.xyz/assets/labels.jsonl
From Reactive to Proactive RAG Systems¶
Most RAG systems improve reactively—waiting for user complaints or noticing obvious failures. This series shows you how to build proactively improving systems that identify problems before users complain and prioritize fixes based on systematic analysis rather than the loudest feedback.
By the end of this notebook, you'll have moved from "we think users struggle with X" to "we know 20% of users need help with artifacts, 15% with integrations, and 14% with visualizations—and we can automatically detect and route these queries for specialized handling."
What You'll Learn¶
In this notebook, you'll discover how to:
Generate Weak Labels and Create a Golden Dataset
- Create an initial classifier using the instructor-classify framework
- Generate preliminary labels for your conversation dataset
- Use app.py to review and correct weak labels for a high-quality labeled dataset
Iteratively Improve Classification Accuracy
- Start with a simple baseline classifier
- Enhance performance with few-shot examples and system prompts
- Measure improvements using confusion matrices and accuracy metrics
Analyze Query Distribution in Your Dataset
- Apply your optimized classifier to the full dataset
- Understand the prevalence of different query types
- Identify high-impact areas for RAG system improvements
Rather than trying to replicate all the nuanced clusters from our topic modeling, we'll focus on three high-impact categories that emerged from our analysis:
- Artifacts - Questions about creating, versioning, and managing W&B artifacts
- Integrations - Questions about integrating W&B with specific libraries and frameworks
- Visualisations - Questions about creating charts, dashboards, and visual analysis
- Other - General queries that don't fit the specialized categories above
By the end of this notebook, you'll have moved from "we discovered these patterns exist" to "we can automatically detect and act on these patterns in production."
Defining Our Classifier¶
Our topic modeling revealed several distinct clusters of user queries, with three major topics accounting for the majority of questions:
- Users seeking help with experiment tracking and metrics logging
- Users trying to manage artifacts and data versioning
- Users needing assistance with integrations and deployment
In this notebook, we'll show how we might build a classifier which can identify queries related to creating, managing and versioning weights and biases artifacts, questions about integrations as well as visualisations of data that's been logged.
- First we'll define a simple classifier using
instructor-classify
that will take in a query and document pair and output a suggested category for it - Then we'll see a few examples of the
instructor-classify
library in action - Lastly, we'll then generate a set of initial weak labels using this simple classifier before exporting it to a file for manual annotation using our
app.py
file.
Let's get started with the instructor-classify
library
from instructor_classify.schema import LabelDefinition, ClassificationDefinition
artifact_label = LabelDefinition(
label="artifact",
description="This is a user query and document pair which is about creating, versioning and managing weights and biases artifacts.",
)
integrations_label = LabelDefinition(
label="integrations",
description="this is a user query and document pair which is concerned with how we can integrate weights and biases with specific libraries",
)
visualisation_label = LabelDefinition(
label="visualisation",
description="This is a user query and document pair which is concerned about how we can visualise the data that we've logged with weights and biases",
)
other_label = LabelDefinition(
label="other",
description="Use this label for other query types which don't belong to any of the other defined categories that you have been provided with",
)
classification_def = ClassificationDefinition(
system_message="You're an expert at classifying a user and document pair. Look closely at the user query and determine what the query is about and how the document helps answer it. Then classify it according to the label(s) above. Classify irrelevant ones as Other",
label_definitions=[
artifact_label,
other_label,
visualisation_label,
integrations_label,
],
)
This structure makes it easy to define multiple categories in a way that's clear to both humans and LLMs. It provides explicit definitions of what each category means, making it easier for the model to make accurate predictions.
We also support exporing this configuration to a .yaml
format for easy use if you're working with domain experts for easy collaboration.
A Simple Example¶
Let's now see how instructor-classify
works under the hood. We'll do so by passing in 4 sample queries and seeing how our classifier is able to deal with these test cases
import instructor
from instructor_classify.classify import Classifier
from openai import OpenAI
client = instructor.from_openai(OpenAI())
classifier = (
Classifier(classification_def).with_client(client).with_model("gpt-4.1-mini")
)
# Make a prediction
result = classifier.predict("How do I version a weights and biases artifact?")
print(f"Classification: {result}") # Should output "artifact";
result_2 = classifier.predict("What is the square root of 9?")
print(f"Classification: {result_2}") # Should output "not_artifact"
Classification: label='artifact' Classification: label='other'
instructor-classify
exposes a batch_predict
function which parallelises this operation for us so that we can run evaluations efficiently over large datasets. Let's see it in action below with some test cases
tests = [
"How do I version a weights and biases artifact?",
"What is the square root of 9?",
"How do I integrate weights and biases with pytorch?",
"What are some best practices when using wandb?",
"How can I visualise my training runs?",
]
labels = ["artifact", "other", "integrations", "other", "visualisation"]
results = classifier.batch_predict(tests)
for query, result, label in zip(tests, results, labels):
print(f"Query: {query}\nClassification: {result}\nExpected: {label}\n")
classify: 0%| | 0/5 [00:00<?, ?it/s]
Query: How do I version a weights and biases artifact? Classification: label='artifact' Expected: artifact Query: What is the square root of 9? Classification: label='other' Expected: other Query: How do I integrate weights and biases with pytorch? Classification: label='integrations' Expected: integrations Query: What are some best practices when using wandb? Classification: label='other' Expected: other Query: How can I visualise my training runs? Classification: label='visualisation' Expected: visualisation
import json
with open("./data/conversations.json") as f:
conversations_raw = json.load(f)
texts = [
{
"query": item["query"],
"matching_document": item["matching_document"],
"query_id": item["query_id"],
}
for item in conversations_raw
]
results = classifier.batch_predict(texts[:110])
with open("./data/generated.jsonl", "w+") as f:
for item, result in zip(conversations_raw, results):
f.write(
json.dumps(
{
"query": item["query"],
"matching_document": item["matching_document"],
"query_id": item["query_id"],
"labels": result.label,
}
)
+ "\n"
)
Evaluating Our Classifier¶
We've manually labelled a dataset of approxiamtely 100 items with their respective labels ahead of time. If you'd like to label more items, we've provided an app.py
file which you can download using the command below
!curl -o ./app.py https://usekura.xyz/assets/app.py
We'll be splitting this into a test and validation split. We'll be using the validation
split to iterate on our prompt and experiment with different few shot examples before using the test
split to validate our classifier's performance.
We'll be using a 70-30 split with 70% of our data used for validation and 30% used for testing our final classifier.
import json
import random
with open("./data/labels.jsonl") as f:
conversations_labels = [json.loads(line) for line in f]
# Set random seed for reproducibility
random.seed(42)
# Shuffle the data
random.shuffle(conversations_labels)
# Calculate split index
split_idx = int(len(conversations_labels) * 0.7)
# Split into validation and test sets
val_set = conversations_labels[:split_idx]
test_set = conversations_labels[split_idx:]
print(f"Validation set size: {len(val_set)}")
print(f"Test set size: {len(test_set)}")
Validation set size: 77 Test set size: 33
Determining a baseline¶
Let's now calculate a baseline and see how well our initial classification model performs.
val_text = [
f"<query>: {item['query']}</query>\n <corpus>{item['matching_document']}</corpus>"
for item in val_set
]
val_labels = [item["label"] for item in val_set]
test_text = [
f"<query>{item['query']}</query>\n <corpus>{item['matching_document']}</corpus>"
for item in test_set
]
test_labels = [item["label"] for item in test_set]
Let's now define a function which runs the classifier on the validation set and the test set to see our initial starting point. We'll look at some of the failure cases and then iterately improve our classifier.
from sklearn.metrics import confusion_matrix
from instructor_classify.classify import Classifier
import instructor
def predict_and_evaluate(classifier: Classifier, texts: list[str], labels: list[str]):
predictions = classifier.batch_predict(texts)
pred_labels = [p.label for p in predictions]
return {
"accuracy": sum(pred == label for pred, label in zip(pred_labels, labels))
/ len(predictions),
"queries": texts,
"labels": labels,
"predictions": pred_labels,
}
model_name = "gpt-4o-mini-2024-07-18"
client = instructor.from_provider("openai/gpt-4o-mini-2024-07-18")
classifier = Classifier(classification_def).with_client(client).with_model(model_name)
predictions = predict_and_evaluate(classifier, val_text, val_labels)
predictions["accuracy"]
0.5454545454545454
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
# Get unique labels
unique_labels = ["artifact", "other", "visualisation", "integrations"]
# Convert predictions and true labels to label indices
y_true = [unique_labels.index(label) for label in predictions["labels"]]
y_pred = [unique_labels.index(label) for label in predictions["predictions"]]
# Calculate single confusion matrix for all categories
conf_matrix = confusion_matrix(y_true, y_pred)
# Plot confusion matrix
plt.figure(figsize=(10, 8))
disp = ConfusionMatrixDisplay(
confusion_matrix=conf_matrix, display_labels=unique_labels
)
disp.plot()
plt.title("Confusion Matrix for All Categories")
plt.tight_layout()
plt.show()
<Figure size 1000x800 with 0 Axes>
Let's now see how it looks like when we run it on our test set
test_predictions = predict_and_evaluate(classifier, test_text, test_labels)
test_predictions["accuracy"]
0.6060606060606061
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
# Get unique labels
unique_labels = ["artifact", "other", "visualisation", "integrations"]
# Convert predictions and true labels to label indices
y_true = [unique_labels.index(label) for label in test_predictions["labels"]]
y_pred = [unique_labels.index(label) for label in test_predictions["predictions"]]
# Calculate single confusion matrix for all categories
conf_matrix = confusion_matrix(y_true, y_pred)
# Plot confusion matrix
plt.figure(figsize=(10, 8))
disp = ConfusionMatrixDisplay(
confusion_matrix=conf_matrix, display_labels=unique_labels
)
disp.plot()
plt.title("Confusion Matrix for All Categories")
plt.tight_layout()
plt.show()
<Figure size 1000x800 with 0 Axes>
Looking at Edge Cases¶
Let's now print out some of the errors that our model made in classifying our user queries
for prediction, label, query in zip(
test_predictions["predictions"],
test_predictions["labels"],
test_predictions["queries"],
):
if label != prediction:
print(f"Label: {label}")
print(f"Prediction: {prediction}")
print(f"Query:\n{query}")
print("=====")
Label: other Prediction: artifact Query: <query>Version control of datasets in machine learning projects</query> <corpus>## Version Control in Machine Learning ### Data Version Control * Data preprocessing (data content changes) such as data cleaning, outlier handling, filling of missing values, etc. * Feature engineering (data becomes "wider") such as aggregation features, label encoding, scaling, etc. * Dataset splits (data is partitioned) typically mean dividing your data into training, validation, and testing data. * Dataset update (data becomes "longer") when new data points are available.</corpus> ===== Label: other Prediction: artifact Query: <query>can you provide a bit more clarity on the difference between setting `resume` in `wandb.init` to `allow` vs. `auto`? I guess the difference has to do with whether the previous run crashed or not. I guess if the run didn't crash, `auto` may overwrite if there's matching `id`?</query> <corpus>| `resume` | (bool, str, optional) Sets the resuming behavior. Options: `"allow"`, `"must"`, `"never"`, `"auto"` or `None`. Defaults to `None`. Cases: - `None` (default): If the new run has the same ID as a previous run, this run overwrites that data. - `"auto"` (or `True`): if the previous run on this machine crashed, automatically resume it. Otherwise, start a new run. - `"allow"`: if id is set with `init(id="UNIQUE_ID")` or `WANDB_RUN_ID="UNIQUE_ID"` and it is identical to a previous run, wandb will automatically resume the run with that id. Otherwise, wandb will start a new run. - `"never"`: if id is set with `init(id="UNIQUE_ID")` or `WANDB_RUN_ID="UNIQUE_ID"` and it is identical to a previous run, wandb will crash. - `"must"`: if id is set with `init(id="UNIQUE_ID")` or `WANDB_RUN_ID="UNIQUE_ID"` and it is identical to a previous run, wandb will automatically resume the run with the id. Otherwise, wandb will crash. See our guide to resuming runs for more. | | `reinit` | (bool, optional) Allow multiple `wandb.init()` calls in the same process. (default: `False`) | | `magic` | (bool, dict, or str, optional) The bool controls whether we try to auto-instrument your script, capturing basic details of your run without you having to add more wandb code. (default: `False`) You can also pass a dict, json string, or yaml filename. | | `config_exclude_keys` | (list, optional) string keys to exclude from `wandb.config`. | | `config_include_keys` | (list, optional) string keys to include in `wandb.config`. |</corpus> ===== Label: other Prediction: artifact Query: <query>Weights & Biases features</query> <corpus>## Weights & Biases overview W&B is a platform that helps data scientists track their models, datasets, system information and more. With a few lines of code, you can start tracking everything about these features. It's free for personal use. Team use is normally a paid utility, but teams for academic purposes are free. You can use W&B with your favourite framework, like TensorFlow, Keras, PyTorch, Sklearn, fastai and many others. All tracking information is sent to a dedicated project page on the W&B UI, where you can open high quality visualizations, aggregate information and compare models or parameters. One of the advantages of remotely storing the experiment’s information is that it is easy to collaborate on the same project and share the results with your teammates. W&B provides 4 useful tools: * Dashboard: Experiment tracking * Artifacts: Dataset versioning, model versioning * Sweeps: Hyperparameter optimization * Reports: Save and share reproducible findings</corpus> ===== Label: other Prediction: artifact Query: <query>Tracking and comparing LLM experiments in Weights & Biases</query> <corpus>### Using Weights & Biases to track experiments Experimenting with prompts, function calling and response model schema is critical to get good results. As LLM Engineers, we will be methodical and use Weights & Biases to track our experiments. Here are a few things you should consider logging: 1. Save input and output pairs for later analysis 2. Save the JSON schema for the response\_model 3. Having snapshots of the model and data allow us to compare results over time, and as we make changes to the model we can see how the results change. This is particularly useful when we might want to blend a mix of synthetic and real data to evaluate our model. We will use the `wandb` library to track our experiments and save the results to a dashboard.</corpus> ===== Label: other Prediction: artifact Query: <query>Best practices for managing data artifacts in W&B</query> <corpus>## Storage If you are approaching or exceeding your storage limit, there are multiple paths forward to manage your data. The path that's best for you will depend on your account type and your current project setup. ## Manage storage consumption W&B offers different methods of optimizing your storage consumption: * Use reference artifacts to track files saved outside the W&B system, instead of uploading them to W&B storage. * Use an external cloud storage bucket for storage. *(Enterprise only)* ## Delete data You can also choose to delete data to remain under your storage limit. There are several ways to do this: * Delete data interactively with the app UI. * Set a TTL policy on Artifacts so they are automatically deleted.</corpus> ===== Label: other Prediction: artifact Query: <query>how to define W&B sweep in YAML</query> <corpus>## Add W&B to your code #### Training script with W&B Python SDK To create a W&B Sweep, we first create a YAML configuration file. The configuration file contains he hyperparameters we want the sweep to explore. In the proceeding example, the batch size (`batch_size`), epochs (`epochs`), and the learning rate (`lr`) hyperparameters are varied during each sweep. ``` # config.yaml program: train.py method: random name: sweep metric: goal: maximize name: val\_acc parameters: batch\_size: values: [16,32,64] lr: min: 0.0001 max: 0.1 epochs: values: [5, 10, 15] ``` For more information on how to create a W&B Sweep configuration, see Define sweep configuration. Note that you must provide the name of your Python script for the `program` key in your YAML file. Next, we add the following to the code example:</corpus> ===== Label: other Prediction: artifact Query: <query>How to configure a sweep in Weights & Biases without a YAML file?</query> <corpus>## Sweep configuration structure ### Basic structure Both sweep configuration format options (YAML and Python dictionary) utilize key-value pairs and nested structures. Use top-level keys within your sweep configuration to define qualities of your sweep search such as the name of the sweep (`name` key), the parameters to search through (`parameters` key), the methodology to search the parameter space (`method` key), and more. For example, the proceeding code snippets show the same sweep configuration defined within a YAML file and within a Python dictionary. Within the sweep configuration there are five top level keys specified: `program`, `name`, `method`, `metric` and `parameters`. {label: 'CLI', value: 'cli'}, {label: 'Python script or Jupyter notebook', value: 'script'}, ]}> Define a sweep in a Python dictionary data structure if you define training algorithm in a Python script or Jupyter notebook. The proceeding code snippet stores a sweep configuration in a variable named `sweep_configuration`:</corpus> ===== Label: other Prediction: integrations Query: <query>logging distributed training wandb</query> <corpus>## Experiments FAQ #### How can I use wandb with multiprocessing, e.g. distributed training? If your training program uses multiple processes you will need to structure your program to avoid making wandb method calls from processes where you did not run `wandb.init()`.\ \ There are several approaches to managing multiprocess training: 1. Call `wandb.init` in all your processes, using the group keyword argument to define a shared group. Each process will have its own wandb run and the UI will group the training processes together. 2. Call `wandb.init` from just one process and pass data to be logged over multiprocessing queues. :::info Check out the Distributed Training Guide for more detail on these two approaches, including code examples with Torch DDP. :::</corpus> ===== Label: other Prediction: artifact Query: <query>What does setting the 'resume' parameter to 'allow' do in wandb.init?</query> <corpus>## Resume Runs #### Resume Guidance ##### Automatic and controlled resuming Automatic resuming only works if the process is restarted on top of the same filesystem as the failed process. If you can't share a filesystem, we allow you to set the `WANDB_RUN_ID`: a globally unique string (per project) corresponding to a single run of your script. It must be no longer than 64 characters. All non-word characters will be converted to dashes. ``` # store this id to use it later when resuming id = wandb.util.generate\_id() wandb.init(id=id, resume="allow") # or via environment variables os.environ["WANDB\_RESUME"] = "allow" os.environ["WANDB\_RUN\_ID"] = wandb.util.generate\_id() wandb.init() ``` If you set `WANDB_RESUME` equal to `"allow"`, you can always set `WANDB_RUN_ID` to a unique string and restarts of the process will be handled automatically. If you set `WANDB_RESUME` equal to `"must"`, W&B will throw an error if the run to be resumed does not exist yet instead of auto-creating a new run.</corpus> ===== Label: other Prediction: artifact Query: <query>wandb.init() code saving</query> <corpus>--- ## displayed\_sidebar: default # Code Saving By default, we only save the latest git commit hash. You can turn on more code features to compare the code between your experiments dynamically in the UI. Starting with `wandb` version 0.8.28, we can save the code from your main training file where you call `wandb.init()`. This will get sync'd to the dashboard and show up in a tab on the run page, as well as the Code Comparer panel. Go to your settings page to enable code saving by default. ## Save Library Code When code saving is enabled, wandb will save the code from the file that called `wandb.init()`. To save additional library code, you have two options: * Call `wandb.run.log_code(".")` after calling `wandb.init()` * Pass a settings object to `wandb.init` with code\_dir set: `wandb.init(settings=wandb.Settings(code_dir="."))` ## Code Comparer ## Jupyter Session History ## Jupyter diffing</corpus> ===== Label: other Prediction: artifact Query: <query>How does wandb.save function and what are its use cases?</query> <corpus>## Save your machine learning model * Use wandb.save(filename). * Put a file in the wandb run directory, and it will get uploaded at the end of the run. If you want to sync files as they're being written, you can specify a filename or glob in wandb.save. Here's how you can do this in just a few lines of code. See [this colab](https://colab.research.google.com/drive/1pVlV6Ua4C695jVbLoG-wtc50wZ9OOjnC) for a complete example. ``` # "model.h5" is saved in wandb.run.dir & will be uploaded at the end of training model.save(os.path.join(wandb.run.dir, "model.h5")) # Save a model file manually from the current directory: wandb.save('model.h5') # Save all files that currently exist containing the substring "ckpt": wandb.save('../logs/*ckpt*') # Save any files starting with "checkpoint" as they're written to: wandb.save(os.path.join(wandb.run.dir, "checkpoint*")) ```</corpus> ===== Label: other Prediction: artifact Query: <query>model registry W&B</query> <corpus>## Register models ### Model Registry After logging a bunch of checkpoints across multiple runs during experimentation, now comes time to hand-off the best checkpoint to the next stage of the workflow (e.g. testing, deployment). The Model Registry is a central page that lives above individual W&B projects. It houses **Registered Models**, portfolios that store "links" to the valuable checkpoints living in individual W&B Projects. The model registry offers a centralized place to house the best checkpoints for all your model tasks. Any `model` artifact you log can be "linked" to a Registered Model. ### Creating **Registered Models** and Linking through the UI #### 1. Access your team's model registry by going the team page and selecting `Model Registry` #### 2. Create a new Registered Model. #### 3. Go to the artifacts tab of the project that holds all your model checkpoints #### 4. Click "Link to Registry" for the model artifact version you want. ### Creating Registered Models and Linking through the **API**</corpus> ===== Label: other Prediction: artifact Query: <query>best practices for tracking experiments in Weights & Biases</query> <corpus>## Create an Experiment ### Best Practices The following are some suggested guidelines to consider when you create experiments: 1. **Config**: Track hyperparameters, architecture, dataset, and anything else you'd like to use to reproduce your model. These will show up in columns— use config columns to group, sort, and filter runs dynamically in the app. 2. **Project**: A project is a set of experiments you can compare together. Each project gets a dedicated dashboard page, and you can easily turn on and off different groups of runs to compare different model versions. 3. **Notes**: A quick commit message to yourself. The note can be set from your script. You can edit notes at a later time on the Overview section of your project's dashboard on the W&B App. 4. **Tags**: Identify baseline runs and favorite runs. You can filter runs using tags. You can edit tags at a later time on the Overview section of your project's dashboard on the W&B App.</corpus> =====
When exmaining our confusion matrices in detail, we observe a consistent pattern of misclassification in the "other" category where our classifier frequently misidentifies these queries by assigning them to one of our specific categories.
Looking at the classification errors, we can identify several patterns
Context Confusion
: The model tends to ignore the user's specific question but instead gets confused by the retrieved document. If a document contains specific bits of information about an artifact, even if the user's question is simply a general question.Over-Eagerness
: The model tends to prefer assigning specialised categories rathern than the more general "other" category, even when evidence is limited. This results in false positives for our specialised categories.
To address these issues, we'll need to carefully craft our prompts to help the model better distinguish between general W&B functionality and specific feature categories.
By combining improved system prompts with strategically selected few-shot examples, we can guide the model to pay closer attention to the user's actual intent rather than being misled by terminology in the retrieved documents.
Our next steps will focus on implementing these improvements and measuring their impact on classification accuracy, particularly for the challenging "other" category where most of our errors occur.
Improving Our Classifier¶
Our baseline classifier achieved approximately 73% accuracy on the validation set, but the confusion matrices revealed significant challenges with the "other" category.
To address these issues, we'll take a systematic approach to enhancement:
- Refining system prompts to provide clearer boundaries between categories and explicitly instruct the model on how to handle ambiguous cases
- Adding few-shot examples that demonstrate the correct handling of edge cases, particularly for general queries that mention specialized terms
Let's get started and see how to do so.
System Prompt¶
The first improvement we'll implement is a more precise system prompt. Our error analysis showed that the model frequently misclassifies general queries as specialized categories when the retrieved document mentions features like artifacts or visualisations.
By providing explicit instructions about how to prioritize the user's query over the retrieved document and establishing clearer category boundaries, we can help the model make more accurate distinctions. We'll also provide a clear description of what each category represents so taht the model can make more accurate distinctions.
import instructor
from instructor_classify.classify import Classifier
from instructor_classify.schema import LabelDefinition, ClassificationDefinition
from openai import OpenAI
client = instructor.from_openai(OpenAI())
artifact_label = LabelDefinition(
label="artifact",
description="This is a user query about how to manage, version and track artifacts with weights and biases",
)
integrations_label = LabelDefinition(
label="integrations",
description="This is a user query about how to use weights and biases with specific software libraries or platforms",
)
visualisation_label = LabelDefinition(
label="visualisation",
description="This is a user query about how to use weights and biases to visualise and track the data that they have collected collected",
)
other_label = LabelDefinition(
label="other",
description="This should be used as a general label for any query that does not exactly fit into any of the other three categories above",
)
SYSTEM_PROMPT = """
You're going to be provided with a query and corpus. Look closely and understand what the query is about and how the document is relevant to the query. Make sure to only consider the parts that are relevant to answering the user's specific question.
Here are the different categories that you should consider. If you're unsure of a category, choose other.
1. Integrations : This should be for queries about connecting W&B with external libraries/frameworks
2. Visualisations : This should be for queries about creating custom charts, data visualisation tools, plotting using W&B
3. Artifact : This should be for queries that deal with W&B artifact's system for dataset/model versioning. This does not include general file saving, logging, general model/experiment tracking or storage of data
4. Others - This is a generic category for questions that aren't captured by the categories above. When in doubt, always default to "other" for general feature questions
"""
classification_def_w_system_prompt = ClassificationDefinition(
system_message=SYSTEM_PROMPT,
label_definitions=[
artifact_label,
other_label,
visualisation_label,
integrations_label,
],
)
client = instructor.from_openai(OpenAI())
classifier_v2 = (
Classifier(classification_def_w_system_prompt)
.with_client(client)
.with_model(model_name)
)
predictions_system_prompt = predict_and_evaluate(classifier_v2, val_text, val_labels)
predictions_system_prompt["accuracy"]
0.7792207792207793
test_predictions_system_prompt = predict_and_evaluate(
classifier_v2, test_text, test_labels
)
test_predictions_system_prompt["accuracy"]
0.8181818181818182
Let's now see how our model performs by using a confusion matrix
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
# Get unique labels
unique_labels = ["artifact", "other", "visualisation", "integrations"]
# Convert predictions and true labels to label indices
y_true = [unique_labels.index(label) for label in predictions_system_prompt["labels"]]
y_pred = [
unique_labels.index(label) for label in predictions_system_prompt["predictions"]
]
# Calculate single confusion matrix for all categories
conf_matrix = confusion_matrix(y_true, y_pred)
# Plot confusion matrix
plt.figure(figsize=(10, 8))
disp = ConfusionMatrixDisplay(
confusion_matrix=conf_matrix, display_labels=unique_labels
)
disp.plot()
plt.title("Confusion Matrix for All Categories")
plt.tight_layout()
plt.show()
<Figure size 1000x800 with 0 Axes>
It seems that with this new prompt, we've seen a roughly 16% improvement in our accuracy. One major issue seems to be classifying queries as the other
category well. Let's visualise some of these queries
for prediction, label, query in zip(
predictions_system_prompt["predictions"],
predictions_system_prompt["labels"],
predictions_system_prompt["queries"],
):
if label != prediction:
print(f"Label: {label}")
print(f"Prediction: {prediction}")
print(f"Query: {query}")
print("=====")
Label: other Prediction: integrations Query: <query>: Bayesian optimization</query> <corpus>## Methods for Automated Hyperparameter Optimization ### Bayesian Optimization Bayesian optimization is a hyperparameter tuning technique that uses a surrogate function to determine the next set of hyperparameters to evaluate. In contrast to grid search and random search, Bayesian optimization is an informed search method. ### Inputs * A set of hyperparameters you want to optimize * A continuous search space for each hyperparameter as a value range * A performance metric to optimize * Explicit number of runs: Because the search space is continuous, you must manually stop the search or define a maximum number of runs. The differences in grid search are highlighted in bold above. A popular way to implement Bayesian optimization in Python is to use BayesianOptimization from the [bayes_opt](https://github.com/fmfn/BayesianOptimization) library. Alternatively, as shown below, you can set up Bayesian optimization for hyperparameter tuning with W&B. ### Steps ### Output ### Advantages ### Disadvantages</corpus> ===== Label: other Prediction: visualisation Query: <query>: Examples of logging images in Wandb</query> <corpus>## Log Media & Objects ### Images ``` images = wandb.Image(image\_array, caption="Top: Output, Bottom: Input") wandb.log({"examples": images}) ``` We assume the image is gray scale if the last dimension is 1, RGB if it's 3, and RGBA if it's 4. If the array contains floats, we convert them to integers between `0` and `255`. If you want to normalize your images differently, you can specify the `mode` manually or just supply a `PIL.Image`, as described in the "Logging PIL Images" tab of this panel. For full control over the conversion of arrays to images, construct the `PIL.Image` yourself and provide it directly. ``` images = [PIL.Image.fromarray(image) for image in image\_array] wandb.log({"examples": [wandb.Image(image) for image in images]}) ``` For even more control, create images however you like, save them to disk, and provide a filepath. ``` im = PIL.fromarray(...) rgb\_im = im.convert("RGB") rgb\_im.save("myimage.jpg") wandb.log({"example": wandb.Image("myimage.jpg")}) ```</corpus> ===== Label: other Prediction: integrations Query: <query>: WandB API examples for sweeps</query> <corpus>"""A set of runs associated with a sweep. Examples: Instantiate with: ``` api = wandb.Api() sweep = api.sweep(path/to/sweep) ``` Attributes: runs: (`Runs`) list of runs id: (str) sweep id project: (str) name of project config: (str) dictionary of sweep configuration state: (str) the state of the sweep expected_run_count: (int) number of expected runs for the sweep """</corpus> ===== Label: other Prediction: integrations Query: <query>: How to structure Weights & Biases runs for hyperparameter tuning?</query> <corpus>## Whats Next? Hyperparameters with Sweeps We tried out two different hyperparameter settings by hand. You can use Weights & Biases Sweeps to automate hyperparameter testing and explore the space of possible models and optimization strategies. ## Check out Hyperparameter Optimization in TensorFlow uisng W&B Sweep $\rightarrow$ Running a hyperparameter sweep with Weights & Biases is very easy. There are just 3 simple steps: 1. **Define the sweep:** We do this by creating a dictionary or a YAML file that specifies the parameters to search through, the search strategy, the optimization metric et all. 2. **Initialize the sweep:** `sweep_id = wandb.sweep(sweep_config)` 3. **Run the sweep agent:** `wandb.agent(sweep_id, function=train)` And voila! That's all there is to running a hyperparameter sweep! In the notebook below, we'll walk through these 3 steps in more detail.</corpus> ===== Label: other Prediction: integrations Query: <query>: How to implement AWS IAM authentication in SageMaker training jobs?</query> <corpus>## Set up for SageMaker ### Prerequisites 1. **Setup SageMaker in your AWS account.** See the SageMaker Developer guide for more information. 2. **Create an Amazon ECR repository** to store images you want to execute on Amazon SageMaker. See the Amazon ECR documentation for more information. 3. **Create an Amazon S3 buckets** to store SageMaker inputs and outputs for your SageMaker training jobs. See the Amazon S3 documentation for more information. Make note of the S3 bucket URI and directory. 4. **Create IAM execution role.** The role used in the SageMaker training job requires the following permissions to work. These permissions allow for logging events, pulling from ECR, and interacting with input and output buckets. (Note: if you already have this role for SageMaker training jobs, you do not need to create it again.) IAM role policy ``` { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "cloudwatch:PutMetricData", "logs:CreateLogStream", "logs:PutLogEvents", "logs:CreateLogGroup", "logs:DescribeLogStreams", "ecr:GetAuthorizationToken" ], "Resource": "\*" }, { "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::<input-bucket>" ] }, { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": [ "arn:aws:s3:::<input-bucket>/<object>", "arn:aws:s3:::<output-bucket>/<path>" ] }, { "Effect": "Allow", "Action": [ "ecr:BatchCheckLayerAvailability", "ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage" ], "Resource": "arn:aws:ecr:<region>:<account-id>:repository/<repo>" } ] } ```</corpus> ===== Label: other Prediction: integrations Query: <query>: What are the common issues when logging distributed training with wandb?</query> <corpus>## Train model with DDP The preceding image demonstrates the W&B App UI dashboard. On the sidebar we see two experiments. One labeled 'null' and a second (bound by a yellow box) called 'DPP'. If you expand the group (select the Group dropdown) you will see the W&B Runs that are associated to that experiment. ### Use W&B Service to avoid common distributed training issues. There are two common issues you might encounter when using W&B and distributed training: 1. **Hanging at the beginning of training** - A `wandb` process can hang if the `wandb` multiprocessing interferes with the multiprocessing from distributed training. 2. **Hanging at the end of training** - A training job might hang if the `wandb` process does not know when it needs to exit. Call the `wandb.finish()` API at the end of your Python script to tell W&B that the Run finished. The wandb.finish() API will finish uploading data and will cause W&B to exit. ### Enable W&B Service ### Example use cases for multiprocessing</corpus> ===== Label: other Prediction: integrations Query: <query>: Are there any best practices for using wandb in a distributed training environment?</query> <corpus>## Add wandb to Any Library #### Distributed Training For frameworks supporting distributed environments, you can adapt any of the following workflows: * Detect which is the “main” process and only use `wandb` there. Any required data coming from other processes must be routed to the main process first. (This workflow is encouraged). * Call `wandb` in every process and auto-group them by giving them all the same unique `group` name See Log Distributed Training Experiments for more details</corpus> ===== Label: other Prediction: integrations Query: <query>: How to use IAM roles with SageMaker for training job access control?</query> <corpus>## Set up for SageMaker ### Prerequisites 1. **Setup SageMaker in your AWS account.** See the SageMaker Developer guide for more information. 2. **Create an Amazon ECR repository** to store images you want to execute on Amazon SageMaker. See the Amazon ECR documentation for more information. 3. **Create an Amazon S3 buckets** to store SageMaker inputs and outputs for your SageMaker training jobs. See the Amazon S3 documentation for more information. Make note of the S3 bucket URI and directory. 4. **Create IAM execution role.** The role used in the SageMaker training job requires the following permissions to work. These permissions allow for logging events, pulling from ECR, and interacting with input and output buckets. (Note: if you already have this role for SageMaker training jobs, you do not need to create it again.) IAM role policy ``` { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "cloudwatch:PutMetricData", "logs:CreateLogStream", "logs:PutLogEvents", "logs:CreateLogGroup", "logs:DescribeLogStreams", "ecr:GetAuthorizationToken" ], "Resource": "\*" }, { "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::<input-bucket>" ] }, { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": [ "arn:aws:s3:::<input-bucket>/<object>", "arn:aws:s3:::<output-bucket>/<path>" ] }, { "Effect": "Allow", "Action": [ "ecr:BatchCheckLayerAvailability", "ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage" ], "Resource": "arn:aws:ecr:<region>:<account-id>:repository/<repo>" } ] } ```</corpus> ===== Label: other Prediction: integrations Query: <query>: wandb setup</query> <corpus>## 🚀 Setup Start out by installing the experiment tracking library and setting up your free W&B account: 1. Install with `!pip install` 2. `import` the library into Python 3. `.login()` so you can log metrics to your projects If you've never used Weights & Biases before, the call to `login` will give you a link to sign up for an account. W&B is free to use for personal and academic projects! ``` !pip install wandb -Uq ``` ``` import wandb ``` ``` wandb.login() ```</corpus> ===== Label: other Prediction: integrations Query: <query>: Weights & Biases features for LLM developers</query> <corpus>**Weights & Biases Prompts** is a suite of LLMOps tools built for the development of LLM-powered applications. Use W&B Prompts to visualize and inspect the execution flow of your LLMs, analyze the inputs and outputs of your LLMs, view the intermediate results and securely store and manage your prompts and LLM chain configurations. #### 🪄 View Prompts In Action **In this notebook we will demostrate W&B Prompts:** * Using our 1-line LangChain integration * Using our Trace class when building your own LLM Pipelines See here for the full W&B Prompts documentation ## Installation ``` !pip install "wandb>=0.15.4" -qqq !pip install "langchain>=0.0.218" openai -qqq ``` ``` import langchain assert langchain.__version__ >= "0.0.218", "Please ensure you are using LangChain v0.0.188 or higher" ``` ## Setup This demo requires that you have an OpenAI key # W&B Prompts W&B Prompts consists of three main components: **Trace table**: Overview of the inputs and outputs of a chain.</corpus> ===== Label: other Prediction: artifact Query: <query>: log prompts wandb</query> <corpus>def log_index(vector_store_dir: str, run: "wandb.run"): """Log a vector store to wandb Args: vector_store_dir (str): The directory containing the vector store to log run (wandb.run): The wandb run to log the artifact to. """ index_artifact = wandb.Artifact(name="vector_store", type="search_index") index_artifact.add_dir(vector_store_dir) run.log_artifact(index_artifact) def log_prompt(prompt: dict, run: "wandb.run"): """Log a prompt to wandb Args: prompt (str): The prompt to log run (wandb.run): The wandb run to log the artifact to. """ prompt_artifact = wandb.Artifact(name="chat_prompt", type="prompt") with prompt_artifact.new_file("prompt.json") as f: f.write(json.dumps(prompt)) run.log_artifact(prompt_artifact)</corpus> ===== Label: other Prediction: artifact Query: <query>: W&B artifacts</query> <corpus>## Register models ### Log Data and Model Checkpoints as Artifacts W&B Artifacts allows you to track and version arbitrary serialized data (e.g. datasets, model checkpoints, evaluation results). When you create an artifact, you give it a name and a type, and that artifact is forever linked to the experimental system of record. If the underlying data changes, and you log that data asset again, W&B will automatically create new versions through checksummming its contents. W&B Artifacts can be thought of as a lightweight abstraction layer on top of shared unstructured file systems. ### Anatomy of an artifact The `Artifact` class will correspond to an entry in the W&B Artifact registry. The artifact has \* a name \* a type \* metadata \* description \* files, directory of files, or references Example usage: ``` run = wandb.init(project="my-project") artifact = wandb.Artifact(name="my\_artifact", type="data") artifact.add\_file("/path/to/my/file.txt") run.log\_artifact(artifact) run.finish() ```</corpus> ===== Label: other Prediction: integrations Query: <query>: building LLM-powered apps with W&B</query> <corpus>## Prompts for LLMs W&B Prompts is a suite of LLMOps tools built for the development of LLM-powered applications. Use W&B Prompts to visualize and inspect the execution flow of your LLMs, analyze the inputs and outputs of your LLMs, view the intermediate results and securely store and manage your prompts and LLM chain configurations. ## Use Cases W&B Prompts provides several solutions for building and monitoring LLM-based apps. Software developers, prompt engineers, ML practitioners, data scientists, and other stakeholders working with LLMs need cutting-edge tools to: * Explore and debug LLM chains and prompts with greater granularity. * Monitor and observe LLMs to better understand and evaluate performance, usage, and budgets. ## Products ### Traces W&B’s LLM tool is called *Traces*. **Traces** allow you to track and visualize the inputs and outputs, execution flow, model architecture, and any intermediate results of your LLM chains. ### Weave ### How it works ## Integrations</corpus> ===== Label: other Prediction: visualisation Query: <query>: Weights & Biases dashboard features</query> <corpus>## Tutorial ### Dashboards Now we can look at the results. The run we have executed is now shown on the left side, in our project, with the group and experiment names we listed. We have access to a lot of information that W&B has automatically recorded. We have several sections like: * Charts - contains information about losses, accuracy, etc. Also, it contains some examples from our data. * System - contains system load information: memory usage, CPU utilization, GPU temp, etc. This is very useful information because you can control the usage of your GPU and choose the optimal batch size. * Model - contains information about our model structure (graph). * Logs - include Keras default logging. * Files - contains all files that were created during the experiment, such as: config, best model, output logs, requirements, etc. The requirements file is very important because, in order to recreate a specific experiment, you need to install specific versions of the libraries.</corpus> ===== Label: other Prediction: integrations Query: <query>: Securing AWS SageMaker training jobs with IAM roles</query> <corpus>## Set up for SageMaker ### Prerequisites 1. **Setup SageMaker in your AWS account.** See the SageMaker Developer guide for more information. 2. **Create an Amazon ECR repository** to store images you want to execute on Amazon SageMaker. See the Amazon ECR documentation for more information. 3. **Create an Amazon S3 buckets** to store SageMaker inputs and outputs for your SageMaker training jobs. See the Amazon S3 documentation for more information. Make note of the S3 bucket URI and directory. 4. **Create IAM execution role.** The role used in the SageMaker training job requires the following permissions to work. These permissions allow for logging events, pulling from ECR, and interacting with input and output buckets. (Note: if you already have this role for SageMaker training jobs, you do not need to create it again.) IAM role policy ``` { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "cloudwatch:PutMetricData", "logs:CreateLogStream", "logs:PutLogEvents", "logs:CreateLogGroup", "logs:DescribeLogStreams", "ecr:GetAuthorizationToken" ], "Resource": "\*" }, { "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::<input-bucket>" ] }, { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": [ "arn:aws:s3:::<input-bucket>/<object>", "arn:aws:s3:::<output-bucket>/<path>" ] }, { "Effect": "Allow", "Action": [ "ecr:BatchCheckLayerAvailability", "ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage" ], "Resource": "arn:aws:ecr:<region>:<account-id>:repository/<repo>" } ] } ```</corpus> ===== Label: other Prediction: artifact Query: <query>: Weights & Biases model registry tutorial</query> <corpus>## Model registry ### How it works 1. **Log a model version**: In your training script, add a few lines of code to save the model files as an artifact to W&B. 2. **Compare performance**: Check live charts to compare the metrics and sample predictions from model training and validation. Identify which model version performed the best. 3. **Link to registry**: Bookmark the best model version by linking it to a registered model, either programmatically in Python or interactively in the W&B UI. The following code snippet demonstrates how to log and link a model to the Model Registry: ```python showLineNumbers import wandb import random # Start a new W&B run run = wandb.init(project="models\_quickstart") # Simulate logging model metrics run.log({"acc": random.random()}) # Create a simulated model file with open("my\_model.h5", "w") as f: f.write("Model: " + str(random.random())) # Log and link the model to the Model Registry run.link\_model(path="./my\_model.h5", registered\_model\_name="MNIST") run.finish() ```</corpus> ===== Label: artifact Prediction: other Query: <query>: experiment tracking</query> <corpus>## Track Experiments ### How it works Track a machine learning experiment with a few lines of code: 1. Create a W&B run. 2. Store a dictionary of hyperparameters, such as learning rate or model type, into your configuration (`wandb.config`). 3. Log metrics (`wandb.log()`) over time in a training loop, such as accuracy and loss. 4. Save outputs of a run, like the model weights or a table of predictions. The proceeding pseudocode demonstrates a common W&B Experiment tracking workflow: ```python showLineNumbers # 1. Start a W&B Run wandb.init(entity="", project="my-project-name") # 2. Save mode inputs and hyperparameters wandb.config.learning\_rate = 0.01 # Import model and data model, dataloader = get\_model(), get\_data() # Model training code goes here # 3. Log metrics over time to visualize performance wandb.log({"loss": loss}) # 4. Log an artifact to W&B wandb.log\_artifact(model) ```</corpus> =====
Few Shot Examples¶
Building on our improved system prompt, we'll now add few-shot examples to our classifier. Few-shot examples provide concrete demonstrations of how to handle tricky edge cases, teaching the model through specific instances rather than abstract rules. This approach is particularly effective for resolving the context confusion and over-eagerness issues we identified in our error analysis.
For each label category, we've carefully selected examples that illustrate:
- Clear positive cases that should be assigned to that category
- Negative cases that might seem related but actually belong elsewhere
An example is when we show queries which were previously classified as integrations (Eg. using AWS IAM ) as others since these are authorization related.
import instructor
from instructor_classify.schema import (
LabelDefinition,
ClassificationDefinition,
Examples,
)
from openai import OpenAI
client = instructor.from_openai(OpenAI())
aartifact_label = LabelDefinition(
label="artifact",
description="This is a user query about how to manage, version and track artifacts with weights and biases",
examples=Examples(
examples_negative=["how do I use the model registry?"],
examples_positive=[
"how can I save versions of my code and models using artifacts across different runs?"
],
),
)
integrations_label = LabelDefinition(
label="integrations",
description="This is a user query about how to use weights and biases with specific software libraries or platforms",
examples=Examples(
examples_negative=[
"how do I run a hyperparameter sweep with W&B"
],
examples_positive=["How can I use w&B with langchain?"],
),
)
visualisation_label = LabelDefinition(
label="visualisation",
description="This is a user query about how to use weights and biases to visualise and track the data that they have collected collected",
)
other_label = LabelDefinition(
label="other",
description="This should be used as a general label for any query that does not exactly fit into any of the other three categories above",
examples=Examples(
examples_positive=[
"can I log images with weights and biases?",
"what optimization methods does W&B support (e.g., Bayesian, grid search)?",
"how do I run a hyperparameter sweep with W&B",
"How to implement AWS IAM authentication",
"distributed wandb usage"
],
),
)
classification_def_w_system_prompt_and_examples = ClassificationDefinition(
system_message=SYSTEM_PROMPT,
label_definitions=[
artifact_label,
other_label,
visualisation_label,
integrations_label,
],
)
classifier_v3 = (
Classifier(classification_def_w_system_prompt_and_examples)
.with_client(client)
.with_model(model_name)
)
predictions_system_prompt_and_examples = predict_and_evaluate(
classifier_v3, val_text, val_labels
)
predictions_system_prompt_and_examples["accuracy"]
0.8961038961038961
for prediction, label, query in zip(
predictions_system_prompt_and_examples["predictions"],
predictions_system_prompt_and_examples["labels"],
predictions_system_prompt_and_examples["queries"],
):
if label != prediction and prediction == "integrations":
print(f"Label: {label}")
print(f"Prediction: {prediction}")
print("## Query")
print(f"{query}")
print("=====")
Label: other Prediction: integrations ## Query <query>: Weights & Biases features for LLM developers</query> <corpus>**Weights & Biases Prompts** is a suite of LLMOps tools built for the development of LLM-powered applications. Use W&B Prompts to visualize and inspect the execution flow of your LLMs, analyze the inputs and outputs of your LLMs, view the intermediate results and securely store and manage your prompts and LLM chain configurations. #### 🪄 View Prompts In Action **In this notebook we will demostrate W&B Prompts:** * Using our 1-line LangChain integration * Using our Trace class when building your own LLM Pipelines See here for the full W&B Prompts documentation ## Installation ``` !pip install "wandb>=0.15.4" -qqq !pip install "langchain>=0.0.218" openai -qqq ``` ``` import langchain assert langchain.__version__ >= "0.0.218", "Please ensure you are using LangChain v0.0.188 or higher" ``` ## Setup This demo requires that you have an OpenAI key # W&B Prompts W&B Prompts consists of three main components: **Trace table**: Overview of the inputs and outputs of a chain.</corpus> ===== Label: other Prediction: integrations ## Query <query>: building LLM-powered apps with W&B</query> <corpus>## Prompts for LLMs W&B Prompts is a suite of LLMOps tools built for the development of LLM-powered applications. Use W&B Prompts to visualize and inspect the execution flow of your LLMs, analyze the inputs and outputs of your LLMs, view the intermediate results and securely store and manage your prompts and LLM chain configurations. ## Use Cases W&B Prompts provides several solutions for building and monitoring LLM-based apps. Software developers, prompt engineers, ML practitioners, data scientists, and other stakeholders working with LLMs need cutting-edge tools to: * Explore and debug LLM chains and prompts with greater granularity. * Monitor and observe LLMs to better understand and evaluate performance, usage, and budgets. ## Products ### Traces W&B’s LLM tool is called *Traces*. **Traces** allow you to track and visualize the inputs and outputs, execution flow, model architecture, and any intermediate results of your LLM chains. ### Weave ### How it works ## Integrations</corpus> =====
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
# Get unique labels
unique_labels = ["artifact", "other", "visualisation", "integrations"]
# Convert predictions and true labels to label indices
y_true = [
unique_labels.index(label)
for label in predictions_system_prompt_and_examples["labels"]
]
y_pred = [
unique_labels.index(label)
for label in predictions_system_prompt_and_examples["predictions"]
]
# Calculate single confusion matrix for all categories
conf_matrix = confusion_matrix(y_true, y_pred)
# Plot confusion matrix
plt.figure(figsize=(10, 8))
disp = ConfusionMatrixDisplay(
confusion_matrix=conf_matrix, display_labels=unique_labels
)
disp.plot()
plt.title("Confusion Matrix for All Categories")
plt.tight_layout()
plt.show()
<Figure size 1000x800 with 0 Axes>
Let's now see the performance of this classifier on the test set
predictions_system_prompt_and_examples = predict_and_evaluate(
classifier_v3, test_text, test_labels
)
predictions_system_prompt_and_examples["accuracy"]
0.8787878787878788
Performance Evolution: From Baseline to Optimized Classifier¶
Let's examine how systematic changes to our prompting strategy transformed our classifier's performance:
Prompt | Baseline | System Only | System + Examples |
---|---|---|---|
Accuracy | 54.5% | 77.9% (+43.1%) | 89.6% (+64.4%) |
These improvements demonstrate the power of thoughtful prompt engineering. By adding a clear system prompt, we saw a significant 43.1% relative improvement. When we further enhanced this with carefully selected examples, we achieved an additional 64.4% gain relative to our baseline, bringing our final validation accuracy to 89.6%. This trend was similarly observed with our holdout test set, where we maintained this high performance level - achieving a 87.8% accuracy compared to the baseline's 60.6%.
This consistency between validation and test performance suggests our improvements are robust and generalizable.
Application¶
Now that we've built and validated a classifier with over almost 90% accuracy, we can confidently apply it to our entire dataset to understand the true distribution of user queries. This isn't just an academic exercise - it's a powerful tool for product development
with open("./data/conversations.json") as f:
conversations_full = json.load(f)
dataset_texts = [
f"<query>{item['query']}</query>\n<corpus>{item['matching_document']}</corpus>"
for item in conversations_full
]
dataset_labels = classifier_v3.batch_predict(dataset_texts)
from collections import Counter
Counter([item.label for item in dataset_labels])
Counter({'other': 292, 'artifact': 112, 'integrations': 94, 'visualisation': 62})
Conclusion¶
In this final notebook, you built a production-ready query classifier through systematic prompt engineering, completing our three-part methodology for RAG system improvement.
Key Achievements¶
- 89.6% validation accuracy and 87.8% test accuracy with 64.4% relative improvement over baseline through systematic prompt engineering
- Clear query distribution insights: 20% artifacts (112/560), 17% integrations (94/560), 11% visualizations (62/560), 52% general queries (292/560) across 560 conversations
- Production-ready methodology: From discovery (clustering) → validation (summaries) → monitoring (classification) → targeted improvements
Production Impact¶
This classifier transforms reactive customer support into proactive system improvement. With nearly half of all queries (48%) falling into three specific categories, you can now focus engineering resources where they'll have maximum impact. Instead of guessing what users struggle with, you have concrete data showing that artifact workflows, framework integrations, and visualization tools are your highest-priority improvement areas.
What's Next¶
You now have the foundation for advanced RAG capabilities: specialized retrieval pipelines for each category, intelligent query routing to domain experts, real-time monitoring of user needs, and data-driven prioritization of features and documentation. The key insight from this series is that RAG improvement isn't about better models alone—it's about systematically understanding your users and building focused solutions for their actual needs.
This approach transforms vague user feedback into actionable intelligence, enabling teams to measure improvement impact with precision and build more effective systems based on real usage patterns rather than assumptions.