Skip to content

Scoring & Grades

Composite Quality Score

The quality score is a weighted average across 9 categories, on a 100-point scale. Computed by AuditResult.quality_score — returns None when no scored checks are present, and normalizes by the sum of present weights so filtered audits (e.g. category="lint") are not penalized for missing categories.

Category Tool Weight
Linting Ruff 15%
Type Safety mypy 15%
Complexity radon 15%
Testing pytest-cov 10%
Test Quality AST analysis 10%
Security Bandit 10%
Dependencies pip-audit + deptry 10%
Architecture AST analysis 10%
Practices AST analysis 5%
pie title Category Weights
    "Linting" : 15
    "Type Safety" : 15
    "Complexity" : 15
    "Testing" : 10
    "Test Quality" : 10
    "Security" : 10
    "Dependencies" : 10
    "Architecture" : 10
    "Practices" : 5

Each category produces a score from 0 to 100. The composite score is:

Text Only
score = lint × 0.15 + type × 0.15 + complexity × 0.15
      + testing × 0.10 + test_quality × 0.10
      + security × 0.10 + deps × 0.10
      + architecture × 0.10 + practices × 0.05

Why no Structure or Tooling categories?

Structure validation (project layout, pyproject.toml completeness) is handled by axm-init with dedicated checks. Tooling availability checks (ruff, mypy, uv on PATH) emit informational findings only. Both categories produce findings but are intentionally excluded from the composite score — axm-audit focuses on code quality.

Category Scoring

Lint Score

Text Only
score = max(0, 100 − issue_count × 2)

Per-category pass threshold: ≥ 90 (≤ 5 issues). The same threshold applies to the composite score — see Grading Scale.

Format Score

Text Only
score = max(0, 100 − unformatted_count × 5)

Per-category pass threshold: ≥ 90 (≤ 2 unformatted files).

Diff Size Score

Text Only
score = 100                    if lines ≤ ideal
score = 0                      if lines ≥ max
score = 100 − (lines − ideal) × 100 / (max − ideal)   otherwise

Defaults: ideal = 400, max = 1200. Configurable via pyproject.toml:

TOML
[tool.axm-audit]
diff_size_ideal = 400   # lines — perfect score ceiling
diff_size_max = 1200    # lines — zero score floor

Per-category pass threshold: ≥ 90 (≤ 480 lines with defaults).

Type Score

Text Only
score = max(0, 100 − error_count × 5)

Per-category pass threshold: ≥ 90 (≤ 2 errors).

Complexity Score

Text Only
score = max(0, 100 − high_complexity_count × 10)

High complexity = cyclomatic complexity ≥ 10. Per-category pass threshold: ≥ 90 (≤ 1 complex function).

Security Score

Average of two sub-scores:

  • Bandit: max(0, 100 − high_count × 15 − medium_count × 5) — vulnerability scanning
  • Hardcoded secrets: max(0, 100 − count × 25) — regex pattern detection

Per-category pass threshold: ≥ 90.

Dependencies Score

Average of two sub-scores:

  • pip-audit: max(0, 100 − vuln_count × 15) — known CVEs (env tools pip, setuptools, wheel, uv, pip-audit are excluded from the count)
  • deptry: max(0, 100 − issue_count × 10) — unused/missing deps

Per-category pass threshold: ≥ 90.

Testing Score

Text Only
score = coverage_percentage

Uses pytest-cov to measure line coverage. Per-category pass threshold: ≥ 90%.

Files whose basename equals __main__.py are excluded from the per-file gap list (they typically only host a python -m entry point and are not meaningfully unit-testable). The aggregate total_pct from pytest-cov is left untouched — this mirrors coverage.py's exclude_also convention of filtering reports rather than rewriting underlying totals. To exclude __main__.py from the aggregate as well, add [tool.coverage.run] omit = ["**/__main__.py"] in the package's pyproject.toml.

Test Quality Score

Average of four sub-scores, each penalising structural defects in the test suite:

  • Pyramid level: max(0, 100 − misplaced_count × P) — tests living at the wrong layer (tests/unit/ vs tests/integration/ vs tests/e2e/)
  • Tautology: max(0, 100 − tautological_count × P) — tests whose body cannot fail (e.g. assert True, asserting against the SUT's own output)
  • Private imports: max(0, 100 − private_count × P) — tests importing underscore-prefixed names instead of going through the public API
  • Duplicate tests: max(0, 100 − duplicate_pair_count × P) — tests with near-identical bodies clustered together

Where P is each rule's per-finding penalty defined in core/rules/test_quality/.

Architecture Score

Average of four sub-scores:

  • Circular imports: max(0, 100 − cycle_count × 20)
  • God classes: max(0, 100 − god_class_count × 15)
  • Coupling: max(0, 100 − N(modules > threshold) × 5) — fan-out exceeding 10 imports
  • Duplication: max(0, 100 − duplicate_pair_count × 10)

Practices Score

Average of four sub-scores:

  • Docstring coverage: int(coverage_pct × 100)
  • Bare excepts: max(0, 100 − count × 20)
  • Blocking I/O: max(0, 100 − count × 15) — detects time.sleep in async contexts and HTTP calls without timeout parameter
  • Test mirroring: max(0, 100 − missing_count × 15)

Grading Scale

Grade Score Meaning
A ≥ 90 Excellent — production-ready
B ≥ 80 Good — minor issues
C ≥ 70 Acceptable — needs attention
D ≥ 60 Poor — significant issues
F < 60 Failing — critical problems

Severity Levels

Each individual check carries a severity:

Severity Effect Example
error Blocks audit pass Missing pyproject.toml
warning Non-blocking High complexity function
info Informational only Docstring coverage stats

Type Safety

All results use Pydantic models (AuditResult, CheckResult, Severity) with extra = "forbid" for strict validation — safe for both human and agent consumption.