Constraint-First Computing

The standard approach treats the graph as the starting point and tries to learn constraints from data. SDC inverts this: constraints are compiled first, then algorithms operate on data whose formal properties are already known.

The source of truth in SDC is the compiled XSD schema—restriction-only complex types, embedded RDF annotations, XPath asserts on Clusters, and the full constraint facet model (enumerations, patterns, length bounds, numeric ranges, temporal restrictions, cardinality, precision). The SHACL shapes are a parallel compilation from the same model instances. The RDF triples are extracted from the schema. The graph is a derivative.

Any discussion of algorithms must start at the constraint layer, because that is where the formal guarantees live. The graph inherits those guarantees—it does not create them.

This inversion—compile constraints first, populate data under those constraints, then run algorithms on formally characterized data—is not incremental. It changes what is algorithmically possible. The sections below progress from the constraint layer through SHACL, embedded RDF semantics, and graph machine learning, with each layer building on the guarantees established by the one below it.

The Constraint Layer

These algorithms operate on the compiled XSD schemas and the restriction lattice itself—before any data populates the graph.

Constraint Layer

Constraint Satisfiability Analysis

Every SDC component compiles to a set of XSD restriction facets: minLength, maxLength, pattern, enumeration, minInclusive, maxInclusive, totalDigits, fractionDigits, plus Cluster-level XPath xsd:assert statements. These facets define a constraint satisfaction problem (CSP).

Algorithms from CSP theory apply directly:

Satisfiability checking—Given a compiled schema, can any valid instance exist? For simple facets this is trivial (is minInclusive less than maxInclusive?), but Cluster asserts introduce cross-field XPath constraints that can create subtle impossibilities. Arc-consistency algorithms detect these conflicts at compile time, before any data is created.
Constraint tightness analysis—How constrained is this schema relative to its reference model base type? A schema with minLength=5, maxLength=10, pattern=^[A-Z][a-z]+$ is measurably tighter than one with just maxLength=255. This metric quantifies schema quality.
Redundancy detection—Does any facet add no additional constraint beyond what another already enforces? For example, minLength=3 is redundant if pattern=^\d{5}$ is also present.

Constraint Layer

The Restriction Lattice

Because SDC uses only xsd:restriction, never xsd:extension, the type hierarchy forms a mathematical lattice. Every type's value space is a proper subset of its parent's. This monotonic narrowing has algorithmic consequences:

Subsumption checking—Given two schemas restricting the same base type, does every valid instance of A also satisfy B? This enables automatic compatibility analysis between schemas from different projects or domains.
Greatest lower bound (GLB)—The tightest schema that admits all instances valid under either input. The schema intersection—useful for merging constraints from two sources describing the same concept.
Least upper bound (LUB)—The loosest schema still tighter than both inputs. The schema union—useful for finding the common constraint core shared by related schemas.
Lattice distance—A rigorous, non-probabilistic similarity measure between data definitions. Fundamentally different from fuzzy text matching on field names.

Application: Structural Catalog Search

When searching for components, lattice-aware search finds components whose constraint structure is similar—same numeric range, same units, same temporal resolution—even if the labels differ entirely. This is structural similarity, not lexical similarity.

Constraint Layer

Schema Differencing and Evolution

SDC schemas are never versioned. When a schema is superseded, the new schema's links field points backward to what it replaces. The old schema is immutable. This creates a directed acyclic graph of supersession.

Constraint diff—Compute exactly which facets changed between a schema and its successor. The diff is computable because both schemas are formal restriction sets against the same reference model base type.
Breaking change detection—A change is "breaking" if the successor's value space is not a superset of the predecessor's. Because the facet model is formal, this is decidable, not heuristic.
Evolution trajectory—Track how constraints have tightened or loosened across a chain of supersessions. A leading indicator of schema maturity.

Constraint Layer

Formalized Failure: Exceptional Values

Real-world data is messy. IoT sensors fail slowly. Form fields get skipped. Measurement devices drift. A constraint model that rejects every non-conforming instance wholesale loses millions of dollars of partially valid data and destroys the failure signal.

SDC handles this with Exceptional Values (EVs)—16 formal types derived from ISO 21090 Null Flavours and extended for real-world usage. When an element fails validation, the failing element is replaced by a specific EV type that records why the value is absent or invalid. The rest of the instance is preserved with full validity. The constraint model does not bend—the violation is formally recorded.

The EV vocabulary goes far beyond simple "null" flags. It distinguishes 16 structurally distinct failure modes:

Category	Code	Meaning
Missing	`NI`	No Information — general default for missing, omitted, or incomplete data
	`UNK`	Unknown — a proper value is applicable but not known
	`NASK`	Not Asked — information was never sought
	`ASKU`	Asked but Unknown — sought but not found (e.g., patient didn't know)
	`ASKR`	Asked and Refused — sought but subject refused to provide
	`NAV`	Not Available — not available, reason unknown
	`NA`	Not Applicable — no proper value applies (e.g., cigarettes/day for a non-smoker)
Security	`MSK`	Masked — value exists but withheld for security or privacy reasons
Validity	`INV`	Invalid — value is not a member of the permitted data values
	`OTH`	Other — value falls outside the coding system
	`UNC`	Unencoded — raw source data, not properly encoded to constraints
Derived	`DER`	Derived — value must be calculated from provided information
Measurement	`NINF`	Negative Infinity — reading below the instrument's measurable range
	`PINF`	Positive Infinity — reading above the instrument's measurable range
	`TRC`	Trace — detected but too small to quantify
	`QS`	Sufficient Quantity — non-zero but unspecified; constitutes the bulk of the material

Each EV type is itself a restriction of the abstract ExceptionalValueType, with a fixed ev-name string. Domain models can further restrict this set or add domain-specific EV subtypes for additional failure modes. The element name is prefixed with ev- for machine-sortable filtering.

This creates a third category between "valid" and "rejected"—formally quarantined—and opens algorithmic opportunities:

Constraint-Bounded Imputation

Standard imputation guesses missing values from statistical patterns. In SDC, the EV-tagged element sits inside a restriction lattice that defines the exact legal value space. A missing XdQuantity with minInclusive=95, maxInclusive=105 constrains the imputation to that range. A missing XdString with enumeration=["A","B","C","D"] reduces imputation to a four-way classification. The result: constraint-bounded imputation is provably more accurate because the search space is formally reduced before the statistical model starts.

Predictive Maintenance via EV Trends

A sequence of EVs on the same element across successive instances is a degradation signal. The specific EV code carries diagnostic information: a shift from NAV (sensor not responding) to PINF (reading above instrument range) suggests the sensor is still alive but physically failing—a different maintenance response than total communication loss. A temperature sensor shifting from 99.9% valid to 95% over a week, with failures moving from TRC (trace amounts) to INV (out of range), reveals a calibration drift pattern. When EVs appear on correlated elements within the same Cluster, the failure may be systemic. This is predictive maintenance that requires zero additional instrumentation—the constraint model generates the failure signal as a side effect of validation.

Query-Time Quality Filtering

Because EVs are formal, machine-readable tags at the element level, downstream queries select their quality threshold: strict mode (fully validated data only), inclusive mode (partial data acceptable), or diagnostic mode (only failures). The graph knows its own quality at the element level. No external data quality layer needed.

Constraint Layer

Component Reuse and Cross-Domain Identity

SDC components are identified by immutable CUID2s. A compiled xsd:complexType for "systolic blood pressure in mmHg"—with its specific restriction facets, semantic links, and embedded RDF—is a reusable artifact. The same component can appear in a cardiology model, an emergency department model, and a clinical trial model. It is the same artifact, referenced by identity.

Deterministic cross-domain graph bridges—When two models share a reused component, the extracted graphs are connected by that shared CUID2 identity. Not probabilistically matched. Identical by construction.
Reuse-weighted catalog metrics—A component used in 15 models across 4 domains has been validated against diverse real-world data. Reuse count is a quality signal.
Interoperability scoring—The ratio of shared components to total components is a deterministic, precise metric computable from schema definitions alone, before any data exists.
Transfer learning via shared components—Reused components provide anchor points that align embedding spaces across domains without explicit transfer learning setup.

The SHACL Layer

SHACL shapes are compiled from the same model instances as the XSD, expressing constraints in RDF-native terms. This creates a second formal system for algorithms to exploit.

SHACL Layer

Shape Analysis and Query Generation

Shape composition and decomposition—Decompose complex Cluster shapes into minimal independent constraint groups. Reveals which components can be validated independently versus which are coupled by cross-references.
SHACL-to-SPARQL query generation—Mechanically transform shapes into SPARQL queries. Each sh:property path becomes a graph pattern; each constraint becomes a FILTER. This enables constraint checking inside systems that support SPARQL but not SHACL natively.
Shape coverage testing—Given a shape and a graph, compute conformance rates, violation clustering, and shape tightening recommendations based on the conforming population.

Schema-Level Semantics

The RDF triples embedded in xsd:appinfo are schema-level semantic annotations. Each component carries rdf:type, rdfs:label, rdfs:comment, and semantic links from W3C/BFO vocabularies pointing to ontology URIs.

Semantic Layer

Ontological Analysis and Similarity

Ontological consistency checking—If a component uses predicate schema:hasUnit with object qudt:Kilogram, verify that the predicate's domain/range declarations are consistent with the component's type. An XdString annotated with qudt:Kilogram is semantically suspicious.
Three-layer semantic similarity—Two components are similar if they share ontology links, restrict the same reference model base type, and have overlapping constraint spaces. This triple metric is rigorously computable from the schema, not statistically estimated from data.
Reuse discovery via schema-level graph analysis—Community detection on the schema-level graph identifies natural component groupings. Graph traversal finds reuse candidates by ontological proximity, without text search.

Graph Machine Learning

Only after the constraint, SHACL, and semantic layers do we reach graph ML on instance data. The key difference: every node and edge was admitted by a formally verified constraint pipeline. The algorithms do not need to learn the "laws of physics"—those are compiled into the schema. They learn the "weather."

Graph Layer

Knowledge Graph Embeddings

Standard embedding methods (TransE, ComplEx, DistMult, RotatE, TuckER) translate RDF triples into dense vectors. SDC's tight predicate vocabulary keeps the embedding space dense. A graph with 40 well-defined predicates produces vastly better embeddings than one with 4,000 ad-hoc predicates.

SDC enhancement: Use the constraint model as embedding initialization. Instead of random initialization, initialize entity embeddings based on their schema's position in the restriction lattice. Reused components share embeddings by identity—they are the same node, not similar nodes. This provides deterministic anchor points that align embedding spaces across domains.

Graph Layer

Relational Graph Neural Networks

R-GCNs allocate a weight matrix per relation type. SDC's constrained predicate set prevents the parameter explosion that kills R-GCNs on undisciplined graphs. Heterogeneous Graph Transformers (HGTs) and Relational Graph Attention Networks (R-GATs) extend this with attention mechanisms.

SDC enhancement: Use SHACL shapes as attention priors. If the shape says a property is required (vs. optional), the attention mechanism weights that relation higher initially. The model still learns, but starts from a structurally informed position.

Graph Layer

GraphRAG and Quality-Stratified ML

SDC's self-describing instances (root element sdc4:dm-{ct_id} + xsi:schemaLocation) mean every triple carries full provenance back to the compiled schema. GraphRAG over SDC data can cite not just "this fact came from this graph" but "this fact was validated against constraint X in schema Y."

Every graph algorithm can be run in quality-stratified mode by filtering on EV presence: train on clean subgraphs only, build separate failure embeddings, bias random walks to avoid or target EV-tagged nodes, and report confidence levels that include quality provenance in LLM citations.

The Research Horizon

These algorithms do not exist yet. They become possible only with formal constraints at the quality level SDC provides.

Research

Constraint-Aware Neural Networks

Encode XSD restriction facets and SHACL shapes as loss function constraints, analogous to Physics-Informed Neural Networks (PINNs). The model is penalized for any prediction that violates a compiled constraint. Training speedup is orders of magnitude—the constraint set eliminates entire regions of the parameter space.

Research

Cross-Domain Federated Learning

Component reuse means domain graphs share actual schema artifacts—same CUID2, same constraint model, same semantic links. Models trained on supply chain data and clinical data share a common semantic coordinate system anchored by reused components. Federated learning across domains without manual schema reconciliation.

Research

Constrained LLM Decoding

The compiled XSD restriction model is mechanically translatable into a context-free grammar. During LLM inference, a grammar mask eliminates any next-token candidate that would violate the schema. The model is physically incapable of generating a payload that violates the constraints.

Libraries implementing grammar-constrained decoding already exist (Outlines, llama.cpp GBNF, Microsoft Guidance). What is missing is the compiler that translates an SDC XSD restriction set into the grammar format these engines consume. The XSD facet model maps cleanly to production rules in a context-free grammar.

Research

Schema-Guided Synthetic Data

Generate synthetic training data by sampling from the constraint space defined by a schema. Instead of GANs learning the data distribution from examples, the schema defines the valid distribution. Every synthetic instance is guaranteed valid by construction. This solves the "not enough training data" problem for schema-constrained domains—particularly in air-gapped environments where real data cannot be exposed.

Research

Schema-Space Federated Query Routing

In a distributed enterprise, current approaches search actual data to find answers. Schema-space routing runs algorithms over compiled schemas—their embedded RDF annotations, BFO groundings, and constraint structures—to determine which systems hold relevant data without touching a single record. Query planning happens in the schema space; data access happens only at the final execution step, under full access controls.

The standard industry approach treats the graph as the starting point and tries to learn constraints from data. SDC inverts this: constraints are compiled first, the graph is populated under those constraints, and algorithms operate on data whose formal properties are already known. This inversion changes what is algorithmically possible.

Built on Formal Foundations

The algorithmic opportunities described here are consequences of SDC's theoretical architecture. See how zero-entropy semantics and two-level modeling create the formal substrate that makes constraint-first computing possible.

Theoretical Foundations Design Philosophy Technical Specifications