Our Technology
Kowalski

Kowalski

Market Intelligence & Structural Monitoring Engine

Kowalski is the proprietary data intelligence system developed by RUC on Rails to acquire, structure and monitor publicly available information across the RUC ecosystem.

It powers the research, comparison and monitoring infrastructure behind RUC Hub and RUC Compare, transforming fragmented provider information into structured, queryable datasets.

Controlled AutomationAI-Assisted ParsingValidation WorkflowsPurpose-Built Architecture
System Architecture

End-to-End Data Pipeline

From source discovery through to structured API delivery, every stage is purpose-built for the complexities of RUC market data.

DiscoverSource mapping
ExtractParse & structure
ValidateQuality checks
StoreStructured data
DeliverAPI & platforms
Why Purpose-Built

Engineered for the RUC Ecosystem

Generic scraping tools struggle in environments where data formats are inconsistent and unpredictable. Kowalski was engineered specifically to handle these conditions.

Variable Pricing Tables

Pricing structures differ across providers with no standard schema.

Shifting Formats

Layouts and page structures change without notice or versioning.

Embedded Documents

Key disclosures appear as embedded PDFs requiring specialised parsing.

Irregular Updates

Disclosures are updated at irregular intervals across providers.

The system maintains structured source profiles for each monitored provider, allowing extraction logic and validation rules to be tuned to real-world RUC publishing patterns rather than relying on fragile one-size-fits-all rules.

Source Intelligence

Source Discovery & Structural Mapping

Kowalski maintains a continuously updated structural model of monitored sources. This structural awareness enables efficient crawling, minimises unnecessary requests and improves resilience when layouts evolve.

source_profile.json
{
"provider": "example_provider",
"surfaces": [
{ "type": "pricing_table",
"layout_sig": "v3.2.1",
"refresh_cycle": "6h",
"last_change": "2026-02-18T14:32Z"
}
],
"nav_depth": 3,
"change_signals": true
}
Link hierarchies
Page classifications
Layout signatures
Update signals
Core Engine

Proprietary Extraction Framework

Rather than relying solely on static selectors, the system evaluates structural context before applying extraction strategies. Extraction workflows are versioned and source-aware, enabling continuous refinement.

01

Deterministic Rules

Pattern-matched extraction for stable, well-structured content with consistent layouts.

  • Stable selectors
  • Fixed schemas
  • Consistent formats
02

Pattern-Based Logic

Flexible heuristics for semi-structured data where layouts shift but retain identifiable patterns.

  • Table recognition
  • Structural hints
  • Layout analysis
03

AI-Assisted Interpretation

Machine learning models for ambiguous tables, PDF documents and complex pricing structures.

  • PDF parsing
  • Ambiguous tables
  • Natural language
Quality Assurance

Validation & Quality Controls

Structured data passes through layered validation controls before appearing in public-facing tools. Where outputs fall outside expected tolerances, records are withheld or queued for review.

Schema Validation

Verified against defined RUC data models

Range & Consistency

Pricing values checked for expected tolerances

Cross-field Integrity

Related fields validated for logical consistency

Anomaly Detection

Structural outliers flagged for manual review

Exception Handling

Non-conforming records withheld before publication

Infrastructure

Distributed Processing & Orchestration

Kowalski operates through coordinated processing services designed for horizontal scaling as monitored sources expand.

Scheduled Crawling

Automated refresh cycles tuned to each provider's update frequency.

Parallel Ingestion

Concurrent processing across multiple providers with isolation guarantees.

Rate Management

Source-aware pacing and intelligent rate limiting to respect external systems.

Fault Isolation

Retry logic and error containment prevent cascading failures across services.

Operational Logging

Full traceability and monitoring across all acquisition and processing stages.

Horizontal Scaling

Architecture supports growth in monitored sources without re-architecture.

Continuous Monitoring

Change Detection & Revision Tracking

Kowalski continuously monitors tracked surfaces for meaningful change. Change detection combines content comparison, structural awareness and document version tracking.

Revision history is preserved where available, supporting temporal comparison rather than simple snapshot replacement.

Pricing adjustments
Fee modifications
New disclosures
Layout changes
Live Event Stream
Provider APrice update detected
Provider CNew disclosure document
Provider BFee schedule revised
Data Layer

Structured Storage & Data Models

Extracted information is normalised into standardised data models designed specifically for RUC market comparison. Each record retains source provenance and acquisition metadata for auditability.

Cross-Provider Comparison

Standardised schemas enable direct pricing and service comparisons.

Provider Profiles

Comprehensive structured profiles generated from aggregated public data.

Revision History

Temporal tracking enables historical comparison and trend analysis.

Search & Filter

Structured datasets power advanced search and filtering across tools.

API Delivery

Structured data available via API for platform integration where applicable.

Source Provenance

Every record retains full lineage back to its originating public source.

Traceability

Provenance & Audit Controls

Every published data point can be traced back to its originating public source through end-to-end traceability.

Source Location
provider.example.co.nz/pricing
Acquisition Time
2026-02-20T09:14:32.847Z
Workflow Version
extract_v4.7.2
Validation Status
PASSED
Change Events
price_update, schema_match
Operating Principles

How We Operate

Kowalski is designed to comply with applicable New Zealand legislation. Responsible operation is built into the architecture, not bolted on after the fact.

Public Data Only

The system engages only with publicly accessible commercial information. No authentication, no personal data, no circumvention of access controls.

Respectful Access

Rate limiting, request pacing and automated backoff controls are built into every ingestion workflow. We assess and respect applicable access terms as part of our source approval process.

Facts, Not Content

Kowalski extracts factual commercial data: pricing figures, fee schedules, service parameters. It does not reproduce editorial or copyrighted content.

Data Minimisation

Only information relevant to defined analytical objectives is collected. Minimisation principles are applied throughout the acquisition pipeline.

Source Governance

Every monitored source goes through an internal approval process covering permissible data classes, monitoring frequency and publication review.

Default to Caution

Where any uncertainty exists regarding a source or data class, engagement is withheld until the position is clear. We err on the side of not acting.

Enabling Structured Transparency

The Intelligence Backbone of RUC on Rails

By combining purpose-built extraction logic, validation controls and continuous monitoring within a dedicated RUC-focused architecture, Kowalski converts fragmented public information into reliable, structured market visibility.