NIST AI RMF in Practice: The Gap Between the Framework and the Evidence
The NIST AI Risk Management Framework has been the industry's reference point for AI governance for long enough now that most organizations I speak to can cite it. They can name the four functions — Govern, Map, Measure, Manage. They can walk you through the categories. Some have printed the framework and distributed it.
What they have not done, in most cases, is produce the evidence that they are actually operating against it.
What the framework is, and is not
The AI RMF is a framework, not a control set. It describes the capabilities a mature AI governance function should have. It does not prescribe the specific artifacts, log formats, or review procedures that constitute proof of those capabilities. That is deliberate — the framework is meant to be adaptable across sectors — but it is also the source of the implementation gap.
"Govern-1.2: The characteristics of trustworthy AI are integrated into organizational policies, processes, procedures, and practices" is a reasonable ask. It is also, for most organizations, not testable. A policy document exists. A slide deck was presented to the board. Was the characteristic actually integrated into a procedure? Produce the procedure. Produce the evidence of its execution against an actual AI system your organization deployed last quarter.
That is the wall most programs hit.
Where compliance breaks down
Three recurring failure modes, from the engagements we have seen:
Policy without workflow. The organization publishes a responsible AI policy. The policy is correct. Nothing in the engineering workflow that produces AI systems changes. Compliance function cannot produce evidence that any specific AI deployment was reviewed against the policy, because there is no mechanism that required the review to happen.
Review without record. AI deployments are reviewed — often carefully — by a committee that meets monthly. The review is real. The record of the review is an email thread, a Slack channel, and a few of the committee members' memories. An auditor asking to see the review of a specific system deployed eight months ago gets a reconstruction, not a record.
Measurement without metrics. The framework asks for measurement of AI system characteristics over time. The organization has a monitoring stack that captures operational metrics. It does not capture the AI-specific characteristics — drift, demographic performance, prompt/response pairs retained for review — because the monitoring stack was designed before the AI deployment and was not updated when the AI deployment went live.
What actually produces evidence
Evidence that survives an audit tends to share a shape. It is produced as a byproduct of the workflow that does the work, not assembled after the fact. It is structured — queryable, exportable, timestamped — rather than narrative. It is traceable to a specific decision, a specific system, a specific reviewer. And it lives in a system of record that the compliance function does not have to beg engineering to access.
This is what we have built toward with the Novaprospect FedRAMP Management Engine on the compliance side and with our agent orchestration framework on the engineering side: a workflow where the AI RMF controls map onto specific artifacts produced during normal operation. The Map function is the Jira issue and the prompt file. The Measure function is the session log and the quality gate output. The Manage function is the pull request, the reviewer decision, and the merge record. The Govern function is the policy that required all of it, enforced by the tooling rather than by reminder emails.
It is not elegant. It is not a single dashboard. It is a set of deliberate choices about where evidence is produced and how it is preserved. But it is the difference between "we are aligned with the AI RMF" as an assertion and "we are aligned with the AI RMF" as a demonstrable fact.
The question to ask
If your organization has adopted the AI RMF, ask your team to produce, end-to-end, the evidence for a single AI system deployed in the last quarter. Pick any system. Ask for the authorization, the risk assessment, the measurement record, the review decisions, and the post-deployment monitoring output, tied together by identifiers that a third party could follow.
If that exercise takes more than an hour, the framework is a document. If it produces the evidence cleanly, the framework is a practice.
Most organizations are still on the document side of that line. Closing the gap is not glamorous work, but it is the work that separates compliance from performance of compliance.