Becoming an AI/ML Platform Engineer: A Capability Map W...

I read a lot of job specs, partly to understand where the field is heading. A recent one stuck with me: a "Lead AI/ML Software Engineer" role building an agentic mission platform. The requirements were not a tidy list of frameworks. They were a stack of whole disciplines layered on top of each other: design and scale distributed systems, run them on Kubernetes, integrate and serve models, wire up agentic LLM workflows, move data through Kafka and Flink, and do all of it under real security and compliance constraints.

That is not one skill. It is six, and the senior version of the role is the ability to hold all six in your head while making a pragmatic tradeoff under ambiguity. Here is the honest map of what that capability actually is, the sources I would use to build each piece (the real ones, with links), and the kind of project that proves you can do it rather than just talk about it. I will point at two systems I have actually built where they fit, because "I shipped this" beats "I studied this" every time.

1. Agentic AI and LLM orchestration

This is the newest layer and the one most people fake. The skill is not "call an LLM." It is knowing when a problem needs an agent at all versus a fixed workflow, then building tool-use loops that an LLM can drive without going off the rails: clean tool schemas, the tool_use / tool_result cycle, retries and timeouts, and context management.

The single best starting point is Anthropic's Building Effective AI Agents, which lays out the distinction between workflows and agents and the five patterns (prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer). Pair it with the tool-use docs and the runnable Anthropic Cookbook agent patterns. When you need durable, stateful multi-agent graphs, LangGraph is the open-source reference.

I learned this one by building it. Apprised.news runs a multi-agent brief pipeline: a set of desk "personas" each read a live news corpus and produce structured briefs, with an orchestrator-and-workers shape, an evaluator pass, and a deterministic, zero-LLM indicator rail bolted alongside so the numbers never hallucinate. The lesson that does not show up in any tutorial: the hard part is not the prompt, it is the orchestration, retries, and knowing which work should not be an LLM call at all.

Prove it: build a multi-agent pipeline with a planner, parallel workers that call real tools, and an evaluator that fact-checks and merges, with a real tool registry and retry/timeout handling. Not one mega-prompt.

2. Distributed systems fundamentals

Everything above sits on distributed-systems basics: service decomposition, inter-service communication, replication and partitioning, consistency models, fault tolerance, and the SLO/error-budget discipline that keeps a system operable.

There is a canonical reading list here and it is short. Martin Kleppmann's Designing Data-Intensive Applications is the book; the first edition (2017) is the one to read today, with a second edition in the works. Then the freely published Google Site Reliability Engineering book and its hands-on companion, the SRE Workbook, for SLOs, error budgets, and operating real services.

Prove it: take any multi-service app you run, define explicit SLOs and an error budget, add structured logging plus tracing plus metrics, then induce a failure (kill a dependency) and write the postmortem. That single exercise demonstrates "fault tolerance and observability" and "debugging distributed systems" better than any cert.

3. Kubernetes-native platform engineering

The platform layer is what turns a pile of services into something a team can build on: Kubernetes itself, then the internal developer platform on top, the self-service infrastructure and "golden paths" that let other engineers ship safely without re-learning the whole stack.

Start with the Kubernetes concepts and architecture docs, then move up a level to the CNCF Platforms White Paper and the Platform Engineering Maturity Model, which define internal developer platforms and golden paths as a discipline. Backstage is the reference implementation of a software catalog plus templates.

Prove it: stand up a small cluster (k3s or kind) and create exactly one real golden path: a self-service template that provisions a new service with logging, health checks, and a deploy pipeline baked in, so "new service" is one command and is compliant by default.

4. MLOps and inference optimization

This is the bridge from "a model exists" to "a model serves traffic reliably and cheaply." Two halves: the lifecycle (CI/CD plus continuous training and drift monitoring) and the serving layer (autoscaling inference, batching, quantization, GPU utilization).

Google's MLOps maturity guide is the canonical framing (levels 0/1/2). Chip Huyen's Designing Machine Learning Systems covers production ML design and monitoring. For serving, learn KServe (Kubernetes-native model serving), vLLM (high-throughput LLM inference with PagedAttention and continuous batching), and NVIDIA Triton.

Prove it: serve a model on Kubernetes with KServe (or vLLM for an LLM), add an autoscaler, and publish a small benchmark: requests/sec and p50/p95/p99 latency before and after one optimization (batching or quantization). That is exactly the "inference optimization in latency-sensitive environments" the senior roles ask for.

5. Event-driven and streaming architectures

Mission and data-intensive platforms are event-driven. The skill is log-based streaming and its delivery and ordering semantics, plus stateful stream processing: windowing, watermarks, and exactly-once.

Learn Apache Kafka (including the Kafka Streams guide) and Apache Flink from their official docs, then read Kafka: The Definitive Guide (cite the 2nd edition, 2021) and Akidau, Chernyak, and Lax's Streaming Systems, which is the conceptual bible for watermarks and exactly-once.

I get a softer version of this lesson from running Corvus, the 800-plus-source intelligence aggregator that feeds Apprised. It is not Flink, but it is genuinely event-driven and distributed: background sync workers pull and fold hundreds of feeds on a schedule, a translation pass enriches non-English items, and downstream consumers read an exported corpus. The recurring lessons (handle late and duplicate data, make every stage idempotent, design for replay) are the same ones Kafka and Flink formalize.

Prove it: build an ingest pipeline where producers push events to Kafka, a Flink or Kafka Streams job does windowed aggregation with exactly-once, and results land in a sink. Document how you handle late data and replay.

6. Secure and compliant mission systems

The constraint that separates "I can build it" from "I can build it for a regulated environment." For government and defense work this means the DoD cloud impact levels (IL2 public/low, IL4 controlled unclassified information, IL5 higher-sensitivity CUI and unclassified national-security systems, IL6 classified up to SECRET), the NIST Risk Management Framework, and baseline security knowledge.

The authoritative sources are all official: the DoD Cyber Exchange cloud security pages (home of the Cloud Computing SRG), NIST SP 800-37 Rev. 2 (the RMF: Prepare, Categorize, Select, Implement, Assess, Authorize, Monitor), the NIST SP 800-53 Rev. 5 control catalog, and CompTIA Security+ (SY0-701) for the baseline cert these roles often require.

Prove it: you cannot host real controlled data, but you can show fluency. Write a short RMF-style system security plan stub for one of your own projects: categorize the system, select a handful of 800-53 controls, and note how your design implements each (encryption in transit and at rest, audit logging, least-privilege IAM). Pair it with passing Security+.

7. Tie it together: a reference architecture you actually built

The artifact that lands a lead-level role is not a certificate. It is a documented system of your own that fuses the layers above: an agentic or distributed AI platform with orchestration, model serving, an event stream, and real observability, written up with a diagram and honest tradeoffs. For patterns, read AWS's Agentic AI on AWS prescriptive guidance and the Google Cloud Architecture Center.

This is also where the rest of jwatte.com connects. The same cheap, fast infrastructure I use to ship these projects is the Cloudflare developer platform, and the auth layer that keeps a multi-tenant AI app secure is Clerk. The fastest way to start is a working AI dev environment, which is exactly what the AI terminal kickstart sets up in one script.

The honest part

You do not build all six at once, and nobody is equally strong in all of them. The senior signal is range plus depth in two or three, plus the judgment to simplify. Every domain above has a "prove it" project precisely because the field rewards shipped systems over consumed courses. Pick the layer you are weakest in, build the smallest real thing that exercises it, and write up what broke. Do that six times and you are not studying the role, you are doing it.

The meta-skill underneath all of it is shipping: building the smallest defensible version of something real, learning from production, and compounding. That is the same muscle behind every project on this site, and the argument of The $97 Launch — start small, ship, and let the work teach you.

Fact-check notes and sources

Agentic AI: Anthropic, Building Effective AI Agents; tool-use docs; Anthropic Cookbook agent patterns; LangGraph.
Distributed systems: Kleppmann, Designing Data-Intensive Applications (1st ed. 2017); Google SRE Book and SRE Workbook.
Kubernetes / platform engineering: Kubernetes docs; CNCF Platforms White Paper and Platform Engineering Maturity Model; Backstage.
MLOps / serving: Google MLOps guide; Huyen, Designing Machine Learning Systems; KServe, vLLM, NVIDIA Triton.
Streaming: Apache Kafka docs, Kafka Streams, Apache Flink docs; Kafka: The Definitive Guide 2nd ed. (2021); Streaming Systems.
Secure mission systems: DoD Cyber Exchange — Cloud Computing Security (CC SRG / impact levels); NIST SP 800-37 Rev. 2 (RMF); NIST SP 800-53 Rev. 5; CompTIA Security+ SY0-701.
Reference architectures: AWS Agentic AI prescriptive guidance; Google Cloud Architecture Center.

This post is informational and reflects my own experience and reading; it is not career or security-compliance advice, and I have no affiliation with the organizations or products linked. Book editions and product details are current as of mid-2026 and change over time.

How to Build the Skills Behind a Lead AI/ML Platform Engineer