Conversations¶
Conversations are the fundamental data units in Kura's analysis pipeline. This document explains how conversations are structured, loaded, and processed.
Conversation Structure¶
In Kura, a conversation is represented by the Conversation
class from kura.types.conversation
:
from kura.types import Conversation, Message
from datetime import datetime
from uuid import uuid4
# Create a simple conversation
conversation = Conversation(
id=str(uuid4()),
created_at=datetime.now(),
messages=[
Message(
role="user",
content="Hello, can you help me with a Python question?",
created_at=str(datetime.now())
),
Message(
role="assistant",
content="Of course! What's your Python question?",
created_at=str(datetime.now())
),
Message(
role="user",
content="How do I read a file in Python?",
created_at=str(datetime.now())
),
Message(
role="assistant",
content="To read a file in Python, you can use the built-in open() function...",
created_at=str(datetime.now())
)
],
metadata={"source": "example", "category": "programming"}
)
Key Components¶
Each conversation contains:
- ID: A unique identifier for the conversation
- Created At: Timestamp for when the conversation was created
- Messages: A list of message objects, each with:
- Role: Either "user" or "assistant"
- Content: The text content of the message
- Created At: Timestamp for when the message was sent
- Metadata: Optional dictionary of additional information
Loading Conversations¶
Kura provides several methods for loading conversations from different sources:
From Claude Conversation Exports¶
from kura.types import Conversation
# Load from Claude export
conversations = Conversation.from_claude_conversation_dump("conversations.json")
From Hugging Face Datasets¶
from kura.types import Conversation
# Load from a Hugging Face dataset
conversations = Conversation.from_hf_dataset(
"ivanleomk/synthetic-gemini-conversations",
split="train"
)
Creating Custom Loaders¶
You can create custom loaders for other data sources by implementing functions that convert your data to Conversation
objects:
def load_from_custom_format(file_path):
# Load and parse your custom data format
data = your_parsing_function(file_path)
# Convert to Conversation objects
conversations = []
for entry in data:
messages = [
Message(
role=msg["speaker"],
content=msg["text"],
created_at=msg["timestamp"]
)
for msg in entry["messages"]
]
conversation = Conversation(
id=entry["id"],
created_at=entry["date"],
messages=messages,
metadata=entry.get("meta", {})
)
conversations.append(conversation)
return conversations
Conversation Processing¶
In the Kura pipeline, conversations go through several processing steps:
- Loading: Conversations are loaded from a source
- Summarization: Each conversation is summarized to capture its core intent
- Metadata Extraction: Optional metadata is extracted from the conversation content
- Embedding: Summaries are converted to vector embeddings
- Clustering: Similar conversations are grouped together
Working with Message Content¶
The content of messages can be in various formats, but should generally be text. HTML, Markdown, or other structured formats will be processed as-is, which may affect summarization quality.
When working with message content:
- Clean up any special formatting if needed
- Remove system messages if they don't contribute to the conversation topic
- Ensure message ordering is correct for proper context
Handling Metadata¶
Conversations can include metadata, which provides additional context:
# Add metadata when creating conversations
conversations = Conversation.from_hf_dataset(
"allenai/WildChat-nontoxic",
metadata_fn=lambda x: {
"model": x["model"],
"toxic": x["toxic"],
"redacted": x["redacted"],
}
)
This metadata can later be used to: - Filter conversations - Analyze patterns across different conversation attributes - Provide additional context for visualization
Next Steps¶
Now that you understand how conversations are structured in Kura, you can:
- Learn about the summarization process
- See how to load different data formats in the Quickstart Guide
- Explore configuration options in the Configuration Guide