Genie: How Uber Organized Data for AI That Works

At Uber, teams like Michelangelo maintained Slack channels for internal support. The volume of questions was enormous: 45,000 questions per month. Users waited a long time for answers, information was fragmented across wikis, internal Stack Overflow, and docs, and the same questions were asked repeatedly.

The solution? Genie, a Gen AI on-call copilot that transformed internal support.

The Architectural Decision: RAG vs Fine-tuning

Uber chose RAG (Retrieval-Augmented Generation) over fine-tuning for practical reasons:

Why not fine-tuning:

Requires high-quality curated data
Needs diverse examples for the LLM to learn
Requires computational resources to update
Longer time-to-market

Why RAG:

Does not require diverse examples to start
Easy to update with new data
Reduced time-to-market
Responds based on real documentation

The challenges to solve with RAG included hallucinations, data security, and user experience.

Architecture: From Data to Response

Genie’s data flow can be generalized as a RAG application using Apache Spark.

Data Ingestion

Sources: Internal wiki (Engwiki), internal Stack Overflow, requirements documents
Processing: Apache Spark for ETL at scale
Embeddings: OpenAI embedding model
Storage: Terrablob (blob storage) + Sia (internal vector DB)

Serving (Response)

Input: User question on Slack
Knowledge Service: Converts question to embedding, searches relevant chunks
LLM: Generates response using retrieved context
Output: Response with source URL + action buttons

ETL Pipeline with Apache Spark

The ingestion pipeline has 4 stages:

1. Data Prep: Fetches content from sources via APIs. Output: DataFrame with URL and content.

2. Embeddings Creation: Chunking with LangChain + embedding generation via OpenAI using PySpark UDFs.

3. Vector Pusher: Push embeddings to Terrablob. Spark jobs for index build and merge.

4. Vector DB Sync: Each leaf syncs and downloads the base index from Terrablob daily.

Why Spark? Distributed processing for high volume, UDFs allow OpenAI integration, facilitates pipeline orchestration, and native integration with blob storage.

Knowledge Service: The Heart of Genie

The Knowledge Service is the backend that processes all queries. The flow:

Receives question via Slack
Generates embedding using Ada Embeddings Model
Searches most relevant chunks in Vector DB
Sends prompt with context to the LLM

Integrated Cost Tracking: Each call passes a UUID through the context, allowing cost tracking by channel, team, or use case. Recommended practice: always implement cost tracking from day one.

The Crucial Insight: Documentation Quality

“If documentation quality is bad, it doesn’t matter how good the LLM is - there’s no way to have good performance.”

Uber created a system to evaluate and improve document quality in the knowledge base. The system returns:

Evaluation score for each document
Explanation of the score
Actionable suggestions on how to improve

Reducing Hallucinations

The main strategy was structuring prompts with sub-contexts and URLs:

Sub-context 1: [content]
Source: [URL]

Sub-context 2: [content]
Source: [URL]

Instruction: Respond ONLY using the sub-contexts above
and cite the source URL for each response.

Result: Each response includes the source URL, allowing user verification.

Other strategies:

Source curation: Only use sources widely available to engineers
Updated data: Daily pipeline ensures recent information
Verification against sources: Mechanisms to verify responses against authoritative sources

Integrated Feedback System

Users give feedback by clicking buttons on Genie’s response:

Resolved: Response completely solved the problem
Helpful: Partially helped, but needs more
Not Helpful: Wrong or irrelevant response
Not Relevant: User needs human help

Real-time data allows quickly identifying problems and adjusting the system.

LLM as Judge

To evaluate responses at scale, Uber uses LLM as a Judge. The LLM compares responses against gold standards or human preferences.

Metrics evaluated:

Hallucination rate
Response relevance
Context coverage
Any custom metric

Results Since Launch

Since September 2023:

154 Slack channels served
70,000+ questions answered
48.9% usefulness rate
13,000 engineering hours saved

Considering average engineer salary, 13,000 saved hours represent significant value in recovered productivity.

6 Insights to Replicate

RAG is faster for MVP: Doesn’t require curated data to start. Fine-tuning can come later.
Doc quality matters more than the LLM: Continuously evaluate and improve documentation. Garbage in, garbage out.
Feedback loop from day 1: Integrate feedback collection into the flow. Use streaming systems for real-time data.
LLM as Judge for evaluation at scale: Allows measuring hallucinations and relevance without relying only on manual feedback.
Cite sources in every response: Structure prompts with sub-contexts + URLs. Reduces hallucinations and increases trust.
Track costs by UUID: Pass identifiers in each call for audit log. Enables cost optimization.

The Main Lesson

Genie demonstrates that AI that works in production doesn’t depend only on the most advanced model. It depends on:

Well-organized data
Quality documentation
Continuous feedback
Cost tracking
Source citation

Data infrastructure is the true competitive differentiator.

At Victorino Group, we help companies organize data and build AI agents with real results. If you want to implement AI that works, let’s talk.