Governance as Advantage

Genie: How Uber Organized Data for AI That Works

TV
Thiago Victorino
12 min read

At Uber, teams like Michelangelo maintained Slack channels for internal support. The volume of questions was enormous: 45,000 questions per month. Users waited a long time for answers, information was fragmented across wikis, internal Stack Overflow, and docs, and the same questions were asked repeatedly.

The solution? Genie, a Gen AI on-call copilot that transformed internal support.

The Architectural Decision: RAG vs Fine-tuning

Uber chose RAG (Retrieval-Augmented Generation) over fine-tuning for practical reasons:

Why not fine-tuning:

  • Requires high-quality curated data
  • Needs diverse examples for the LLM to learn
  • Requires computational resources to update
  • Longer time-to-market

Why RAG:

  • Does not require diverse examples to start
  • Easy to update with new data
  • Reduced time-to-market
  • Responds based on real documentation

The challenges to solve with RAG included hallucinations, data security, and user experience.

Architecture: From Data to Response

Genie’s data flow can be generalized as a RAG application using Apache Spark.

Data Ingestion

  • Sources: Internal wiki (Engwiki), internal Stack Overflow, requirements documents
  • Processing: Apache Spark for ETL at scale
  • Embeddings: OpenAI embedding model
  • Storage: Terrablob (blob storage) + Sia (internal vector DB)

Serving (Response)

  • Input: User question on Slack
  • Knowledge Service: Converts question to embedding, searches relevant chunks
  • LLM: Generates response using retrieved context
  • Output: Response with source URL + action buttons

ETL Pipeline with Apache Spark

The ingestion pipeline has 4 stages:

1. Data Prep: Fetches content from sources via APIs. Output: DataFrame with URL and content.

2. Embeddings Creation: Chunking with LangChain + embedding generation via OpenAI using PySpark UDFs.

3. Vector Pusher: Push embeddings to Terrablob. Spark jobs for index build and merge.

4. Vector DB Sync: Each leaf syncs and downloads the base index from Terrablob daily.

Why Spark? Distributed processing for high volume, UDFs allow OpenAI integration, facilitates pipeline orchestration, and native integration with blob storage.

Knowledge Service: The Heart of Genie

The Knowledge Service is the backend that processes all queries. The flow:

  1. Receives question via Slack
  2. Generates embedding using Ada Embeddings Model
  3. Searches most relevant chunks in Vector DB
  4. Sends prompt with context to the LLM

Integrated Cost Tracking: Each call passes a UUID through the context, allowing cost tracking by channel, team, or use case. Recommended practice: always implement cost tracking from day one.

The Crucial Insight: Documentation Quality

“If documentation quality is bad, it doesn’t matter how good the LLM is - there’s no way to have good performance.”

Uber created a system to evaluate and improve document quality in the knowledge base. The system returns:

  • Evaluation score for each document
  • Explanation of the score
  • Actionable suggestions on how to improve

Reducing Hallucinations

The main strategy was structuring prompts with sub-contexts and URLs:

Sub-context 1: [content]
Source: [URL]

Sub-context 2: [content]
Source: [URL]

Instruction: Respond ONLY using the sub-contexts above
and cite the source URL for each response.

Result: Each response includes the source URL, allowing user verification.

Other strategies:

  • Source curation: Only use sources widely available to engineers
  • Updated data: Daily pipeline ensures recent information
  • Verification against sources: Mechanisms to verify responses against authoritative sources

Integrated Feedback System

Users give feedback by clicking buttons on Genie’s response:

  • Resolved: Response completely solved the problem
  • Helpful: Partially helped, but needs more
  • Not Helpful: Wrong or irrelevant response
  • Not Relevant: User needs human help

Real-time data allows quickly identifying problems and adjusting the system.

LLM as Judge

To evaluate responses at scale, Uber uses LLM as a Judge. The LLM compares responses against gold standards or human preferences.

Metrics evaluated:

  • Hallucination rate
  • Response relevance
  • Context coverage
  • Any custom metric

Results Since Launch

Since September 2023:

  • 154 Slack channels served
  • 70,000+ questions answered
  • 48.9% usefulness rate
  • 13,000 engineering hours saved

Considering average engineer salary, 13,000 saved hours represent significant value in recovered productivity.

6 Insights to Replicate

  1. RAG is faster for MVP: Doesn’t require curated data to start. Fine-tuning can come later.

  2. Doc quality matters more than the LLM: Continuously evaluate and improve documentation. Garbage in, garbage out.

  3. Feedback loop from day 1: Integrate feedback collection into the flow. Use streaming systems for real-time data.

  4. LLM as Judge for evaluation at scale: Allows measuring hallucinations and relevance without relying only on manual feedback.

  5. Cite sources in every response: Structure prompts with sub-contexts + URLs. Reduces hallucinations and increases trust.

  6. Track costs by UUID: Pass identifiers in each call for audit log. Enables cost optimization.

The Main Lesson

Genie demonstrates that AI that works in production doesn’t depend only on the most advanced model. It depends on:

  • Well-organized data
  • Quality documentation
  • Continuous feedback
  • Cost tracking
  • Source citation

Data infrastructure is the true competitive differentiator.


At Victorino Group, we help companies organize data and build AI agents with real results. If you want to implement AI that works, let’s talk.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation