Power BI report visual monitoring
Long-form documentation: rendered output as an observability signal, how the open-source monitor works, and how to run it safely. Project: Power-bi-report-visual-monitoring.
Definition
Visual monitoring here means: open a published report in a controlled browser session, capture a rendered bitmap after the platform has executed the model (DAX, relationships, filters), compare it to a stored baseline, and persist a compact record (status, difference score, hashes, optional compressed delta). The observable is the user-visible state of the report, not an internal engine trace.
This is distinct from Usage Metrics (who consumed what) and from the Performance Analyzer (per-visual timing inside Desktop). Those tools answer different questions. Visual monitoring answers: “Did the rendered surface drift from what we agreed was correct?”
The approach is complementary to warehouse-level tests: you still want data tests where they matter, but the report layer adds layout, interactions, and measure semantics that are expensive to re-specify as pure SQL invariants for every tile.
Why it is effective
Many production defects surface first as layout, conditional formatting, filter interactions, or silently wrong measures while refresh and gateway health remain green. A pixel-oriented or perceptual-hash-oriented check catches regressions that are tedious to encode as invariant SQL on the warehouse, because the bug lives in the last mile of semantics inside the report layer.
A quadtree / MSE style comparison (as in this codebase) trades absolute fidelity for bounded work: cost scales with canvas structure and thresholds, not with the cardinality of every fact table touched by the model.
From an operations perspective, a scheduled visual check is also a contract on the external appearance of the service: if the screenshot changes, something in the chain (model, filters, theme, or host embedding) changed enough to matter for readers of that canvas.
Dashboard geometry and KPI sensitivity
Large operational dashboards often allocate a dominant share of canvas area to headline KPIs, trend cards, and variance callouts. Under typical design patterns, those visuals carry disproportionate business attention relative to peripheral navigation chrome.
Monitoring the rendered canvas therefore attaches high semantic leverage per check: a visible shift in a large KPI tile is likely to be noticed by humans even when underlying row-level aggregates would require multiple ad hoc queries to reconstruct the same gestalt.
That does not mean small visuals are unimportant; it means prioritization for monitoring capacity often starts with the largest, most-read tiles when you have a limited render budget.
Observability cost model (render + diff vs repeated warehouse probes)
Consider two families of validation strategy:
- Strategy A — run several heavy SQL (or DAX-equivalent) probes directly against the same warehouse that feeds the model, recreating fragments of each KPI definition. Each probe may scan large partitions, join dimensions, and recompute complex measures. Total latency and load often grow roughly with the number and depth of probes, especially when KPI logic is non-linear in data volume.
- Strategy B — let the report engine render once, then compute a bounded visual diff against a baseline and write a small log row (
status,diff_percent, perceptual hash, duration). The expensive data movement has already been amortized by the BI platform; the incremental monitoring step is dominated by render stability and image comparison, whose cost is tied to image resolution and diff parameters rather than to issuing k separate full-model interrogations.
In many deployments, Strategy A with a moderate k of non-trivial reconciliation queries imposes higher cumulative warehouse pressure and tail latency than Strategy B at a fixed schedule, provided render time is acceptable for the SLA. This is a qualitative ordering statement, not a universal constant: a trivial single-query invariant can be cheaper than a long headless render. The useful claim is comparative under realistic KPI complexity, not “always faster than any SQL”.
Concrete millisecond figures belong in a controlled benchmark appendix once measured in your environment; they are not asserted here.
Architecture and data flow
Configuration sources
The monitor reads report definitions from reports.json (or the path in REPORTS_FILE): each entry carries id, name, url, interval, threshold, and enabled. Those values are not duplicated inside PostgreSQL; the database stores outcomes of checks only.
Runtime settings (PostgreSQL connection, screenshot size, diff policy, retry policy, Selenium wait, optional Basic auth for Power BI) come from environment variables loaded via Settings. See the repository README for the full variable list.
On-disk artifacts
Under the configured screenshots_dir (default ./Data), the tool maintains:
Data/baselines/<report_id>_init_baseline.png— the first accepted full-page capture used as the stable reference for XOR delta encoding.Data/changes/<report_id>/last_baseline.png— the rolling baseline image used for the next comparison.Data/changes/<report_id>/current_screenshot.png— the latest capture before it is promoted tolast_baseline.pngwhen the check succeeds.Data/changes/<report_id>/current_screenshot_diff.png— optional diff overlay when diff rendering is enabled.
Runtime pipeline (happy path)
For each scheduled run the CheckReport use case: ensures directories exist; determines whether a baseline already exists (row in baselines plus both baseline images on disk); calls Selenium to capture either the initial baseline or the current page; on first run copies the init image to last_baseline.png, computes a perceptual hash (dhash), upserts baselines, and writes a baseline_created row to monitoring_checks.
On subsequent runs it compares the new capture to last_baseline.png using the configured diff policy (quadtree / MSE path in the domain service), optionally builds a delta against the init baseline, updates metrics, sets changed when diff_percent > 0 else unchanged, replaces last_baseline.png with the new image on success, updates the hash in baselines, and inserts a row into monitoring_checks including gzip-compressed XOR delta bytes when produced.
Persistence layer
PostgreSQL holds baselines (one row per report id) and the append-oriented log monitoring_checks. Views v_latest_checks and v_report_stats support dashboards and ad hoc queries. The schema is applied with python -m pbimonitor --init-db, which creates the schema if needed and executes schema.sql.
Scheduling and workers
An in-process scheduler dispatches enabled reports to worker threads with backpressure between runs. Failures in the worker that escape the use case are logged separately; normally errors are persisted as status = error rows so history stays coherent.
Check statuses
Each attempt produces one row in monitoring_checks with a string status:
- baseline_created — first successful capture for that report: baseline images exist on disk,
baselinesrow is written,diff_percentis 0, no delta payload. - unchanged — baseline existed; diff pipeline returned
diff_percent == 0; the new screenshot becomes the nextlast_baseline.png. - changed — baseline existed;
diff_percent > 0; indicates a non-zero visual delta under the current policy before promotion of the screenshot. - error — any exception in the check path (Selenium, diff, storage).
diff_percentand screenshot hash are cleared in the stored record;errorcarries the message for triage.
The per-report threshold from reports.json is used for metrics (tracking how often the diff score sits below that threshold); it does not by itself flip changed vs unchanged in code, where zero vs positive diff area is the branch.
Configuration overview
reports.json
Beyond URL and interval, pay attention to threshold (0–100) as a reporting knob for “small visual drift” tracking, and to start_time if you later align schedules with business hours. Disabled reports are skipped entirely when loading.
Environment highlights
PG_*andPG_SCHEMA_ALLOWLIST— database target and safety rail for schema application.PAGE_LOAD_WAIT,SCREENSHOT_WIDTH,SCREENSHOT_HEIGHT— trade stability vs fidelity; oversized viewports slow rendering.MSE_THRESHOLD,MIN_BLOCK_SIZE,MAX_DEPTH,DIFF_ENABLED— control quadtree sensitivity and whether diff images are drawn.RETRY_*— backoff for flaky networks or transient auth.POWERBI_USERNAME/POWERBI_PASSWORDandPOWERBI_AUTH_SERVER_WHITELIST— optional Basic auth and host allowlisting for embedded flows.
Sensitivity and tuning
When diffs are noisy
Animations, live clocks, rotating ads, or non-deterministic web fonts can create pixel churn without meaningful business change. Mitigations: lengthen wait, narrow viewport to the stable report region if your layout allows, increase MSE_THRESHOLD or MIN_BLOCK_SIZE cautiously, and verify that the report itself is not rendering timestamps into the canvas.
When changes are missed
Very small single-pixel shifts or low-contrast edits may fall under thresholds. Tighten MSE_THRESHOLD, reduce MIN_BLOCK_SIZE within performance limits, and confirm last_baseline.png is updating (so you are not comparing stale captures).
Per-report threshold
Use threshold consistently with how you read below_diff_threshold_rate in metrics: it is a product-level knob for “nuisance vs signal” analytics, not a second hidden diff gate in the status transition.
Operations
Bootstrap sequence
- Prepare PostgreSQL credentials and allowlist, copy
.env.exampleto.env. - Install dependencies, set
PYTHONPATH=src(or install the package editable). python -m pbimonitor --init-db— applies DDL.python -m pbimonitor --build— establishes baselines for all enabled reports.python -m pbimonitor --check <id>— smoke a single report.python -m pbimonitor --start— continuous scheduler loop.
Docker
docker compose up --build -d brings up the monitor process alongside PostgreSQL; optional Redis service is reserved for future distributed queueing and is not required for the current in-process scheduler.
Operational prerequisites
The host running Chrome must reach the report URLs; integrated authentication flows may need allowlists or manual sign-in strategies beyond Basic auth. Keep secrets out of git; treat Data/ as sensitive cache because screenshots may contain confidential numbers.
Limitations and fit
Visual monitoring does not prove that numbers are arithmetically correct, only that the rendered page matches or diverges from a reference image under the chosen diff policy. Pair it with data quality checks where correctness is defined at row level.
Headless rendering consumes CPU and memory; dense schedules across many large viewports can contend with interactive users if collocated. Row-level security and per-user render differences mean baselines must be captured under representative credentials.
Features listed in the repository roadmap (user click scripts, OCR, external API probes, clustered deployment) are forward-looking until shipped; today’s scope is the visual detection pipeline described above.
Glossary
- Init baseline — first stable screenshot for XOR delta reference.
- Last baseline — rolling image used for the next comparison.
- diff_percent — percentage of canvas area flagged as different by the quadtree/MSE pipeline.
- dhash — perceptual hash of the baseline or current capture for quick fingerprinting.
- XOR delta — compressed binary difference between current pixels and the init baseline for forensic reconstruction.