Phoenix Teardown: How Arize Built an ML Observability Platform

#ml#observability#architecture

title: "Phoenix Teardown: How Arize Built an ML Observability Platform" date: "2026-05-16" description: "A deep dive into Arize Phoenix's architecture — tracing, evaluation, and the design decisions behind an open-source ML observability tool." tags: ["ML", "Observability", "Architecture"]

Phoenix Teardown

Phoenix is an open-source ML observability platform built by Arize AI. In this post I'll walk through its architecture, trace how requests flow through the system, and highlight the design decisions that make it work.

Why Phoenix?

Most ML teams struggle with observability after deployment. Traditional APM tools (Datadog, New Relic) weren't designed for the unique challenges of ML systems:

  • Distribution shift in input features over time
  • Model performance degradation that doesn't correlate with system metrics
  • Embedding drift that's invisible to standard monitoring

Phoenix addresses these by providing purpose-built tracing for LLM applications.

Architecture Overview

At a high level, Phoenix consists of three layers:

# The core tracing instrumentation
from phoenix.otel import register
 
tracer_provider = register(
    project_name="my-llm-app",
    endpoint="http://localhost:6006/v1/traces",
)

The collector ingests OpenTelemetry-compatible spans and stores them for analysis:

interface Span {
  traceId: string;
  spanId: string;
  parentSpanId?: string;
  name: string;
  attributes: Record<string, AttributeValue>;
  startTime: number;
  endTime: number;
}

Evaluation Framework

Phoenix ships with built-in evaluators that run as post-processing on collected traces:

from phoenix.evals import HallucinationEvaluator, QAEvaluator
 
hallucination_eval = HallucinationEvaluator(model)
qa_eval = QAEvaluator(model)
 
# Run evals on a dataset of traces
results = run_evals(
    dataframe=trace_df,
    evaluators=[hallucination_eval, qa_eval],
    provide_explanation=True,
)

Key Design Decisions

  1. OpenTelemetry-native — Rather than inventing a proprietary protocol, Phoenix builds on OTel standards. This means any OTel-compatible exporter works out of the box.

  2. Local-first — Phoenix runs as a lightweight local server during development, then scales to a hosted deployment in production. No cloud account required to get started.

  3. Dataframe-centric — All data is accessible as pandas DataFrames, making it trivial to integrate with existing ML workflows.

What's Next

In a future post, I'll dig into how Phoenix handles embedding drift detection and the math behind their similarity search implementation.