Closed-Loop Multi-Vendor Network Ops

multivendor-ai-network-lab

A 26-device multivendor (Juniper / Arista / FRR) network operations lab driven by a Pydantic-AI orchestrator, eval harness, and immutable AI audit trail.

0lab devices
0sites
0MCP tools
0REST endpoints
0CLI corpus (BM25)
0backend pytest

Overview

What it is

Multivendor AI Network Lab is a closed-loop NetOps reference implementation: two real labs — a containerlab CLOS EVPN-VXLAN fabric and a docker-compose FRR backbone — are driven by a Flask monolith whose core is a vendor-neutral driver abstraction layer.

It solves the problem of safely operating heterogeneous Juniper/Arista/FRR networks with AI, running an observe → diagnose → remediate → verify → document loop with RFC 6241 confirmed-commit safety and auto-rollback. The whole surface is exposed to AI agents through a 64-tool MCP server and to humans through a web UI and Telegram bot. It is built as a portfolio / demo lab, not a production system.

Key Features

Built for safe AI-driven change

Every capability is grounded in a real source module — from the change pipeline down to the immutable audit ledger.

Closed-loop change pipeline

POST /api/change/closed-loop chains 6 stages — Predict → Batfish → Apply (Health Gate) → Watch → POST diff → Intent verify — into one governed operation with auto-rollback (APPROVED in 12s, ROLLED_BACK in 6s on induced regression).

Health Gate confirmed-commit

RFC 6241 §8.4 confirmed-commit gate: every config change runs inside a watch window and auto-reverts if any signal degrades.

Vendor-neutral driver core

BaseNetworkDriver template method with a get_driver() factory maps each vendor (FRR/EOS/SRL/Junos/IOSXR) to a concrete driver and transport, returning a frozen, soft-failing DriverResult.

Pydantic-AI multi-agent orchestrator

Structured-output Routing / ACL / Incident agents route symptoms to diagnoses; degrades to deterministic offline mode without an Anthropic API key.

64-tool MCP server

A FastMCP surface exposes every capability so Claude Code, Cursor, and opencode can call the lab directly as agent tools.

GAIT immutable audit + auto-postmortem

Every AI action is appended to a JSONL ledger with token cost; the postmortem writer stitches GAIT + Health Gate + Remediation events into structured markdown incident reports.

NetBox SoT drift detection

Severity-tiered (critical/high/medium/low) comparison between the NetBox source of truth and the running lab across IP / AS / site / vendor / model / role / OS fields.

CLI RAG over 9,802 commands

Stdlib-only BM25 retrieval over a multivendor CLI corpus returns vendor-specific matches with citation deep-links — no embeddings or model download.

Architecture

System context & components

Humans and AI agents drive the Flask closed-loop core, which in turn drives the two physical labs, the Anthropic API, telemetry backends, and source-of-truth / verification systems. Inside the monolith every request resolves a device, picks a vendor driver, and runs commands through a transport.

System Context

How humans and AI agents drive the Flask core, which drives the two labs, the Anthropic API, telemetry, and SoT/verification.

flowchart TB
    operator(["NOC Operator - web UI and Telegram"]):::actor
    agents(["Claude Code - via MCP, 64 tools"]):::actor
    system{{"multivendor-ai-network-lab - Flask :5757 closed-loop ops"}}:::core
    labs["Two Labs - CLOS EVPN-VXLAN and FRR backbone"]:::infra
    anthropic["Anthropic API - claude-haiku-4-5"]:::ai
    tsdb["InfluxDB 2.7 and Grafana 10.4"]:::data
    sot["NetBox and Batfish - SoT and verification"]:::data
    operator -->|"symptoms, changes"| system
    agents -->|"tool calls"| system
    system -->|"docker-exec, SSH"| labs
    system -->|"diagnose, judge"| anthropic
    system -->|"line protocol"| tsdb
    system -->|"drift, what-if"| sot
    labs -.->|"live state"| system
    classDef actor fill:#475569,stroke:#94a3b8,color:#f8fafc
    classDef core fill:#60a5fa,stroke:#93c5fd,color:#03121f
    classDef infra fill:#5eead4,stroke:#99f6e4,color:#03121f
    classDef ai fill:#a78bfa,stroke:#c4b5fd,color:#0b0614
    classDef data fill:#34d399,stroke:#6ee7b7,color:#03121f

Container & Component Map

The closed-loop, AI, and audit modules all layer on top of a single driver core resolving device → driver → transport.

flowchart TB
    subgraph FE["Front-ends"]
        mcp["MCP Server - FastMCP, 64 tools"]:::edge
        tg["Telegram Bot - async httpx"]:::edge
        ui["Static Demo UI"]:::edge
    end
    subgraph API["Flask Monolith :5757"]
        app["app.py - device-ops routes"]:::svc
        mvbp["mv_bp Blueprint - api mv routes"]:::svc
    end
    subgraph LOOP["Closed-Loop Modules"]
        predict["Predict and Blast Radius"]:::accent
        gate["Health Gate - confirmed-commit"]:::accent
        remed["Auto-Remediate and Postmortem"]:::accent
    end
    subgraph CORE["Driver Core"]
        driver["Driver Abstraction - factory, BaseDriver"]:::core
        transport["Transport - docker-exec, SSH"]:::core
    end
    orch["AI Orchestrator - Pydantic agents"]:::ai
    gait["GAIT Audit - append-only JSONL"]:::data
    tele["Telemetry Collector to InfluxDB"]:::data
    mcp --> mvbp
    tg --> mvbp
    ui --> app
    app --> mvbp
    mvbp --> predict --> gate
    mvbp --> orch
    gate --> driver
    remed --> gate
    driver --> transport
    orch --> gait
    gate --> gait
    tele --> driver
    classDef edge fill:#475569,stroke:#94a3b8,color:#f8fafc
    classDef svc fill:#60a5fa,stroke:#93c5fd,color:#03121f
    classDef accent fill:#a78bfa,stroke:#c4b5fd,color:#0b0614
    classDef core fill:#5eead4,stroke:#99f6e4,color:#03121f
    classDef data fill:#34d399,stroke:#6ee7b7,color:#03121f

How it works · Data flow

From raw CLI to a governed change

Fabric nodes (SR Linux, cEOS, FRR, Junos) expose raw CLI state. A collector polls each node over docker-exec; drivers/parsers.py normalizes it into a vendor-neutral schema; a parallel health snapshot fans the sections into one JSON doc; metrics flow to InfluxDB 2.7 / Grafana 10.4; and every AI action and gate decision is appended to the GAIT ledger with token cost.

Closed-Loop Change Pipeline

A governed change is gated by blast radius, applied through the Health Gate under a confirmed-commit watch window, then confirmed or auto-rolled-back — every step appended to the GAIT ledger.

sequenceDiagram
    actor Op as Operator or Agent
    participant API as Flask change closed-loop
    participant Pred as Predict and Blast Radius
    participant Gate as Health Gate
    participant Dev as Device driver
    participant Gait as GAIT Audit
    Op->>API: submit change
    API->>Pred: simulate diff and cascade depth
    Pred-->>API: verdict APPROVE WARN or REJECT
    alt REJECT
        API-->>Op: blocked by blast radius
    else proceed
        API->>Gate: apply with confirmed-commit
        Gate->>Dev: edit PyEZ Junos or simulated FRR
        Gate->>Dev: watch BGP iface alerts
        alt signals clean
            Gate->>Dev: confirm commit
            Gate-->>Op: CONFIRMED
        else signals degrade
            Gate->>Dev: auto-rollback
            Gate-->>Op: ROLLED_BACK
        end
    end
    API->>Gait: append record plus token cost

Health Gate State Machine

The Health Gate models a change as a confirmed-commit lifecycle — snapshot, apply, watch, then either confirm or roll back. Auto-remediation runs its fixes through the same machine.

stateDiagram-v2
    [*] --> PreSnapshot
    PreSnapshot --> Applied: edit committed
    Applied --> Watching: start watch window
    Watching --> Confirmed: signals clean
    Watching --> RolledBack: signals degrade
    Confirmed --> [*]
    RolledBack --> [*]
    classDef good fill:#34d399,stroke:#6ee7b7,color:#03121f
    classDef bad fill:#a78bfa,stroke:#c4b5fd,color:#0b0614
    classDef work fill:#60a5fa,stroke:#93c5fd,color:#03121f
    class Confirmed good
    class RolledBack bad
    class Watching work

Tech Stack

The real tooling

A Python / Flask backend, FastMCP agent surface, multi-vendor network operating systems, and an observability + verification plane.

Python 3.11+ Flask 3.x Flask Blueprints gunicorn FastMCP (mcp[cli]) Anthropic SDK (claude-haiku-4-5) Pydantic junos-eznc (PyEZ) netmiko nornir pynetbox Batfish (digital twin) containerlab docker-compose Nokia SR Linux Arista cEOS / EOS FRRouting Junos InfluxDB 2.7 Grafana 10.4 gnmic python-telegram-bot pytest

Components · Modules

Module responsibilities

Real source modules and directories mapped to what they own.

Module / DirResponsibility
src/app.pyMain Flask API and demo UI server on port 5757; legacy device-ops routes and static demo serving.
Flask · Python 3.11+
src/multivendor_extensions.pymv_bp Flask Blueprint exposing the /api/mv/* endpoints (orchestrator, intent verify, path trace, eval, GAIT, runbooks, CVE, translator, TOON).
Flask Blueprints · type hints
src/drivers/Vendor-neutral driver abstraction: BaseNetworkDriver template method, get_driver() factory, per-vendor subclasses (FRR/EOS/SRL/Junos/IOSXR), transports, parsers, frozen DriverResult.
Python ABC · dataclasses · ThreadPoolExecutor · Protocol
src/mcp_dcn_server.pyFastMCP server exposing lab capabilities as MCP tools so Claude Code / Cursor / opencode can call any capability.
FastMCP · mcp[cli]
src/pydantic_ai_orchestrator.pyMulti-agent orchestrator with structured outputs (Routing / ACL / Incident agents); degrades to offline rule-based mode without an API key.
Pydantic · Anthropic SDK (claude-haiku-4-5)
src/health_gate.pyRFC 6241 §8.4 confirmed-commit Health Gate: snapshot, apply, watch window, then confirm or auto-rollback on regression.
Python · junos-eznc (PyEZ)
src/health.pySingle-device health snapshot: parallel fan-out collecting BGP/OSPF/interfaces/routes/mem/CPU into one JSON doc via GET /api/health/<hostname>.
Python · concurrent.futures
src/netbox_sot.pyNetBox source-of-truth drift detector: severity-tiered (critical/high/medium/low) presence and field drift between SoT and observed state.
Python · pynetbox
src/auto_remediate.py · src/remediation.pyAuto-remediation proposal state machine: drift → AI proposes runbook → human approves → executes through the Health Gate.
Python · YAML runbooks
src/postmortem.pyAuto-postmortem writer: stitches GAIT + Health Gate + Remediation events into a structured markdown incident report with P1 detection.
Python · markdown
src/gait_audit.pyGAIT immutable append-only JSONL audit trail recording every AI action with token-in/token-out cost.
Python · append-only JSONL · threading lock
src/cli_rag.pyBM25 retrieval over a 9,802-command multivendor CLI corpus with citation deep-links; stdlib-only, no embeddings.
Python stdlib · BM25
src/predict_engine.py · blast_radius.py · forecast_engine.pyChange prediction, blast-radius cascade-depth simulation, and predictive forecasting that feed the closed-loop change pipeline.
Python · optional Batfish digital twin
network-lab/Docker lab and sanitized device configs: docker-compose FRR container mesh (ports 2201-2210) plus 16 sanitized junos+eos configs and inventory.json.
docker-compose · FRRouting
demo/Static HTML/JS demo UI (index.html, phase3.js) and animated in-app architecture page.
HTML · JavaScript

Quickstart

Run it locally

Bring up the FRR backbone lab, then start the Flask app and open the demo UI.

# clone the lab
$ git clone https://github.com/gesh75/multivendor-ai-network-lab.git
$ cd multivendor-ai-network-lab

# start the docker-compose FRR backbone
$ cd network-lab && docker-compose up -d

# create a venv and install deps
$ cd ../src && python3 -m venv venv && source venv/bin/activate
$ pip install -r requirements.txt

# launch the Flask app (port 5757) and open the demo UI
$ python3 app.py
$ open http://localhost:5757/demo/index.html

This is a portfolio / demo lab, not a production system. The README documents a production-migration roadmap — move state out of memory into Redis/RQ, adopt TextFSM parsers, split the Flask blueprint, add a CI gate, and wire in a secrets manager.