Benchmarking Kura
Kura is an open-source topic modeling library that automatically discovers and summarizes high-level themes from large collections of conversational data. By combining AI-powered summarization with advanced clustering techniques, Kura transforms thousands of raw conversations into actionable insights about what your users are actually discussing.
We benchmarked Kura across three critical dimensions: processing performance, storage efficiency, and clustering quality. Our results show that Kura delivers production-ready performance with:
- Fast, predictable processing: 6,000 conversations analyzed with GPT-4o-mini in under 7 minutes and just around $2 in token costs (using 20 concurrent tasks)
- Storage is not an issue: 440x compression ratios mean even 100,000 conversations require only 20MB of storage - storage overhead is negligible for production workloads
- Accurate topic discovery: Over 85% cluster alignment when validated against similar conversation topics
In this article, we'll walk through our benchmark methodology, detailed findings and how you can apply these results to your own use cases.
Dataset Used
For this benchmark, we used the lmsys/mt_bench_human_judgments dataset from Hugging Face.
This dataset contains 3,000+ rows of human preferences between two model responses to identical questions, with each question sampled multiple times across different model pairs.
We generated two conversations per row, creating a 6,000+ conversation evaluation dataset that tests clustering quality with identical inputs and varying responses.
Note
If you're interested in the full dataset, we've uploaded the processed dataset we used here to hugging face. We also have the full benchmarking scripts and datasets generated at here.