Code Mower CloudQuality and velocity for AI-assisted development Sign in

Sample Code Mower dashboard

A concrete preview of the kind of private team signal CodeMower.com is meant to show after opt-in metadata uploads.

Illustrative sample data

This page uses example numbers to show the product shape before sign-in. It is not a live cross-team cohort benchmark. Real dashboards use your team's own metadata first; cohort comparison becomes useful only as enough teams opt in.

The minimal OSS example for this same loop lives in examples/demo-calibration: one known-clean control, one known-blocked control, and a sample reviewer value report. This page shows the richer dashboard shape once a team has uploaded more history.

Dashboard areas in the signed-in product

CodeMower.com is organized around operating decisions, not raw upload logs. Each signed-in page answers a specific question and links back to the source metadata that made the recommendation possible.

Reviewer Value

Which AI reviewer/lens should I trust?

Ranks provider/lens pairs by useful signal, false positives, cost, latency, and current recommendation.

Repository Health

Where is my evidence strong or thin?

Shows coverage by repo, separates dogfood/history/calibration, and flags repos that still need reviewer events.

Reports & Evidence

What artifacts are backing the dashboard?

Surfaces uploaded report kinds, recent evidence rows, and provenance so teams can audit what the numbers mean.

Productivity & ROI

What is the modeled value of the loop?

Models parallel reviewer capacity, cost per useful signal, and net value without pretending one human has more hours.

What uploading metadata unlocks

Reviewer value

Which provider/lens pairs catch useful issues without noisy false positives on your own repos.

Cost and latency

How much each lane costs and how long it takes before you promote it into a stronger workflow.

Data provenance

Dogfood, imported history, and calibration evidence stay visibly separated so the dashboard does not overclaim.

Enable-next guidance

Practical recommendations for informational, selective, or merge-gating lanes as evidence accumulates.

Evidence classes stay separate

CodeMower.com keeps operational dogfood, imported history, and reviewer evidence visibly distinct so the dashboard can be useful without overstating what the data proves.

Current dogfood

steps

events

Use for ingestion health, recency, repo coverage, and whether dogfood workflows are running.

current repo metadata and provider inventory

Imported history

steps

events

Coverage and timeline context before dogfood was enabled.

sanitized GitHub Actions history

Reviewer evidence

steps

events

Use for reviewer and lens value once verdicts are tied to known-clean and known-blocked cases.

metadata-only reviewer verdict artifacts

Issue planning lineage

GitHub issue to delivery

codemower-ai/code-mower#269

Code Mower can carry metadata from a GitHub issue into a posted plan, work order, builder/reviewer run, PR, merge, and upload. The dashboard shows what is captured and what is still missing before claiming end-to-end delivery evidence.

4/7 steps captured

Issue

codemower-ai/code-mower#269

Posted plan

issue-269-plan.md

Work order

issue-269-work-order.md

PR not linked yet

Reviewer checks

Reviewer checks not linked yet

Merge

Merge not linked yet

Upload

Metadata uploaded

Next missing evidence: PR.

Sample dashboard preview

This preview uses the same dashboard components as the signed-in team experience, populated with representative sample data inspired by the kind of reviewer effectiveness and productivity signal teams care about.

Cloud activity

Benchmark signal at a glance

Recent uploads, reviewer events, and source provenance for your team. The goal is to make dogfooding health visible before anyone has to read the raw tables.

Last 30 days

Uploads

Metadata bundles received for your teams.

Events

128

Structured reviewer and workflow events.

Repos

Distinct repositories seen in recent metadata.

Spend

$42.18

Recent reported reviewer spend.

Latency

84s

Average recent provider/runtime latency.

Activity trend

Upload and event volume over the last 14 days.

UploadsEvents

Latest day

06-15

Latest uploads

Latest events

Source mix

Dogfood, calibration, and imported history stay separated so the signal is explainable.

146events

code-mower-local54 · dogfood

lens-calibration-corpus40 · calibration

historical-backfill18 · history

github-actions16 · dogfood

Other sources18 · other

Dogfood

Calibration

Historical

Anonymous uploads

Reviewer signal

Which lanes are earning trust?

These cards summarize useful signal, noisy lanes, spend, latency, and the next lane recommendation from your own uploaded metadata.

Best reviewer so far

Not enough data yet

Upload calibration events to identify which reviewer/lens has useful signal on your codebase.

Noisy lanes

No noisy lane yet

No recent lane has more false positives than useful signal.

Cost / latency

$42.18 / 84s

Recent reported provider spend and average runtime latency across uploaded events.

What to enable next

Promote codex-audit on risky PRs

Base Codex reviews have the best useful-rate in this sample with low false positives and acceptable latency.

Evidence maturity

How strong is this data?

Code Mower should not overclaim. This ladder shows which decisions are supported now, which data is only context, and what is still needed before treating provider/lens comparisons as benchmark-grade.

Step 1

Current operations

Ready

You can tell whether current metadata upload plumbing is alive.

41 uploads and 54 dogfood events in scope.

Next: Keep dogfood running from active repositories.

Step 2

Historical context

Ready

You can use imported history for activity and workflow context.

18 imported history events in scope.

Next: Keep history visually separated from reviewer accuracy evidence.

Step 3

Reviewer value

Ready

You can start comparing provider/lens signal in this scope.

66 reviewer runs, 74 calibration events, and 41 useful signals.

Next: Keep collecting known-clean and known-blocked cases before promoting lanes.

Step 4

Benchmark trust

Ready

Provider/model/version provenance is strong enough for early benchmark claims.

92% tool, 81% model, and 72% version coverage.

Next: Use this scope for early provider/lens comparisons, with sample-size caveats.

Productivity model

Parallel AI review capacity

Code Mower can exceed 24 hours/day because this is aggregate portfolio capacity across agents, reviewer lanes, and repos. It is not a claim that one human can work more than one wall-clock day.

Reviewer runs

Provider/lens runs represented in recent metadata.

Useful signals

Useful issues or decisions captured by reviewer events.

Capacity modeled

2d 5h

Conservative aggregate review and follow-up attention estimate.

Net value

$7,870

Modeled value minus reported provider spend in the recent window.

Assumptions

Reviewer run20 minutes of human review attention

Useful signal45 minutes of triage/fix follow-up

Loaded hourly rate$150/hour

Reported provider spend$42.18

Calculation

Review capacity66 runs x 20 min + 41 useful x 45 min = 2d 5h

Aggregate pace1.8h/day across parallel lanes

Gross modeled value$7,913

Annualized run-rate$94,444

This is a model, not accounting. v1 should let teams tune assumptions and should distinguish reviewer value, builder value, avoided defects, and cycle-time acceleration.

Calibration cases

Known-clean and known-blocked PRs used to judge reviewer behavior.

Reviewer runs

128

Structured reviewer and lens runs across the calibration corpus.

30-day spend

$42

Reported API or subscription spend from local metadata.

Avg latency

84s

Average reviewer runtime for recent structured audits.

Best reviewer so far

codex-audit

Strong useful-rate with a low false-positive rate on the sample corpus.

Noisy lane

gemini-cli / operability

High false-positive rate means this lane should stay advisory until calibrated further.

What to enable next

Selective triggers

Run Codex on every candidate PR; trigger Claude quality lens on backend/auth/data changes.

What this dashboard should help you decide

Who should review every risky PR?

Use codex-audit as the baseline lane.

It caught the known-blocked control and stayed quiet on the known-clean control.

Who should be selective?

Use Claude for matching backend/auth/data classes.

It added useful independent signal, but with higher cost and latency in this sample.

Who should stay informational?

Keep the experimental lens out of branch protection.

It was cheap, but it blocked a clean control and missed a blocked control.

Reviewer value table

Lane	Lens	Useful rate	False positives	Cost / run	Latency	Recommendation
codex-audit	base	82%	9%	$0.42	71s	Merge-gating eligible after one more clean cycle
claude-audit	context-driven-quality	74%	14%	$0.61	96s	Selective trigger for higher-risk PRs
gemini-cli	operability	38%	31%	$0.08	52s	Informational until more calibration data exists

Why contribute metadata?

The local OSS tool already gives you private reports. Opt-in cloud sharing adds team history, dashboard rows, token management, export/delete controls, and a place to compare reviewer value over time without keeping every terminal artifact in someone's laptop. Cross-team comparisons come later, after enough teams opt in for the data to be honest.

Ready to try the local path? Follow the setup guide. Want the OSS source first? Open GitHub.