Synthetic Data¶

January 19, 2025
in Kura, Synthetic Data
8 min read

Evaluating Kura's Clustering ability using Synthetic Analysis

Over the weekend, I spent some time to evaluate Kura's clustering ability using synthetic data. When tested against synthetically generated technical conversations, Kura is able to identify base clusters that align with our original category distribution with over 95% accuracy and also discover more nuanced groupings that align with real-world technical divisions and use cases.

In this article, we'll walk through the process of how we generated a diverse dataset of ~190 user conversations and then evaluated Kura's clustering ability against this dataset. These findings demonstrate that language model-assisted clustering can identify natural conversation patterns while validating synthetic data generation approaches.

Generating Synthetic Data

I carefully constructed a dataset of 190 user conversations by using a multi-step process. You can access the dataset of these conversations on hugging face here. To do so, we introduced controlle variation at each level through a systematic approach that involved 3 steps.