Skip to content

Index

What is Kura?

Kura is a library that aims to help you make sense of user data. By using language models to iteratively summarise and cluster conversations, it provides a modular and flexible way for you to understand broad high level trends in your user base.

It's built with the same ideas as CLIO but open-sourced so that you can try it on your own data. I've written a walkthrough of the code that you can read to understand the high level ideas behind CLIO.

The work behind Kura is kindly sponsored by Improving RAG. If you're looking for a way to make sense of user data and tame your user application, please check them out.

Why is it useful?

By combining traditional clustering techniques with language models, we can get a much better understanding of the underlying structure of your data. For instance, using kura, we can identify clusters not by the content of the conversations but by the specific user intent.

Our API is designed to be modular and flexible so that you can easily extend it to fit your needs. Here's an example of how you can use it to cluster your own claude conversation history.

Note

If you're using a different formatting for your messages, you can also just manually create a list of Convesation objects and pass them into the cluster_conversations method. This is useful if you're exporting conversations from a different source

from kura import Kura
from asyncio import run
from kura.types import Conversation


kura = Kura()
conversations: list[Conversation] = Conversation.from_claude_conversation_dump(
    "conversations.json"
)
run(kura.cluster_conversations(conversations))

Roadmap

Kura is currently under active development and I'm working on adding more features to it. On a high level, I'm working on writing out more examples on how to swap out specific components of the pipeline to fit your needs as well as to improve on the support for different conversation formats.

Here's a rough roadmap of what I'm working on, contributions are welcome!

  • Implement a simple Kura clustering class
  • Implement a Kura CLI tool
  • Support hooks that can be ran on individual conversations and clusters to extract metadata
  • Support heatmap visualisation
  • Support ChatGPT conversations
  • Show how we can use Kura with other configurations such as UMAP instead of KMeans earlier on
  • Support more clients/conversation formats
  • Provide support for the specific flag that we can use in the CLi to specify the clustering directory and the port

I've also recorded a technical deep dive into what Kura is and the ideas behind it if you'd rather watch than read.

Getting Started

Note

Kura ships using the gemini-1.5-flash model by default. You must set a GOOGLE_API_KEY environment variable in your shell to use the Google Gemini API. If you don't have one, you can get one here.

To get started with Kura, you'll need to install our python package and have a list of conversations to cluster.

pip install kura
uv pip install kura

If you're using the Claude app, you can export your conversation history here and use the Conversation.from_claude_conversation_dump method to load them into Kura.

If you don't have a list of conversations on hand, we've also uploaded a sample dataset of 190+ conversations onto hugging face that were synthetically generated by Gemini which we used to validate Kura's clustering ability.

Kura ships with an automatic checkpointing system that saves the state of the clustering process to disk so that you can resume from where you left off so there's no need to worry about losing your clustering progress.

With your conversations on hand, there are two ways that you can run clustering with Kura.

CLI

We provide a simple CLI tool that runs Kura with some default settings and a react frontend with associated visualisation. To boot it up, simply run the following command:

kura

This will in turn start up a local FastAPI server that you can interact with to upload your data and visualise the clusters. It roughly takes ~1 min for ~1000 conversations with a semaphore of around 50 requests at any given time. If you have higher concurrency, you can increase the semaphore to speed up the process.

> kura
INFO:     Started server process [41539]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

You can combine multiple different conversation files into a single cluster by uploading them all at once.. We also provide options to modify parameters such as the maximum number of clusters and whether you'd like to rerun the clustering process out of the box.

Using the Python API

You can also use the Python API to do the same thing.

from kura import Kura
from asyncio import run
from kura.types import Conversation


kura = Kura()
conversations: list[Conversation] = Conversation.from_claude_conversation_dump(
    "conversations.json"
)
run(kura.cluster_conversations(conversations))

We assume here that you have a conversations.json file in your current working directory which contains data in the format of the Claude Conversation Dump. You can see a guide on how to export your conversation history from the Claude app here.

Loading Custom Conversations

As mentioned above, if you're using a different formatting for your messages, you can also just manually create a list of Conversation objects and pass them into the cluster_conversations method. This is useful if you're exporting conversations from a different source.

Let's take the following example of a conversation

conversations = [
    {
        "role": "user",
        "content": "Hello, how are you?"
    },
    {
        "role": "assistant",
        "content": "I'm fine, thank you!"
    }
]

We can then manually create a Conversation object from this and pass it into the cluster_conversations method.

from kura.types import Conversation
from uuid import uuid4

conversation = [
    Conversation(
        messages=[
            Message(
                created_at=str(datetime.now()),
                role=message["role"],
                content=message["content"],
            )
            for message in conversation
        ],
        id=str(uuid4()),
        created_at=datetime.now(),
    )
]

Once you've done so, you can then pass this list of conversations into the cluster_conversations method.

Note

To run clustering you should have ~100 conversations on hand. If not, the clusters don't really make much sense since the language model will have a hard time generating meaningful clusters of user behaviour

from kura.types import Conversation

conversations: list[Conversation] = Conversation.from_claude_conversation_dump(
    "conversations.json"
)
run(kura.cluster_conversations(conversations))