Constraint-First Computing
The standard approach treats the graph as the starting point and tries to learn constraints from data. SDC inverts this: constraints are compiled first, then algorithms operate on data whose formal properties are already known.
The source of truth in SDC is the compiled XSD schema—restriction-only complex types, embedded RDF annotations, XPath asserts on Clusters, and the full constraint facet model (enumerations, patterns, length bounds, numeric ranges, temporal restrictions, cardinality, precision). The SHACL shapes are a parallel compilation from the same model instances. The RDF triples are extracted from the schema. The graph is a derivative.
Any discussion of algorithms must start at the constraint layer, because that is where the formal guarantees live. The graph inherits those guarantees—it does not create them.
This inversion—compile constraints first, populate data under those constraints, then run algorithms on formally characterized data—is not incremental. It changes what is algorithmically possible. The sections below progress from the constraint layer through SHACL, embedded RDF semantics, and graph machine learning, with each layer building on the guarantees established by the one below it.
The Constraint Layer
These algorithms operate on the compiled XSD schemas and the restriction lattice itself—before any data populates the graph.
Constraint Satisfiability Analysis
Every SDC component compiles to a set of XSD restriction facets: minLength, maxLength, pattern, enumeration, minInclusive, maxInclusive, totalDigits, fractionDigits, plus Cluster-level XPath xsd:assert statements. These facets define a constraint satisfaction problem (CSP).
Algorithms from CSP theory apply directly:
- Satisfiability checking—Given a compiled schema, can any valid instance exist? For simple facets this is trivial (is
minInclusiveless thanmaxInclusive?), but Cluster asserts introduce cross-field XPath constraints that can create subtle impossibilities. Arc-consistency algorithms detect these conflicts at compile time, before any data is created. - Constraint tightness analysis—How constrained is this schema relative to its reference model base type? A schema with
minLength=5, maxLength=10, pattern=^[A-Z][a-z]+$is measurably tighter than one with justmaxLength=255. This metric quantifies schema quality. - Redundancy detection—Does any facet add no additional constraint beyond what another already enforces? For example,
minLength=3is redundant ifpattern=^\d{5}$is also present.
The Restriction Lattice
Because SDC uses only xsd:restriction, never xsd:extension, the type hierarchy forms a mathematical lattice. Every type's value space is a proper subset of its parent's. This monotonic narrowing has algorithmic consequences:
- Subsumption checking—Given two schemas restricting the same base type, does every valid instance of A also satisfy B? This enables automatic compatibility analysis between schemas from different projects or domains.
- Greatest lower bound (GLB)—The tightest schema that admits all instances valid under either input. The schema intersection—useful for merging constraints from two sources describing the same concept.
- Least upper bound (LUB)—The loosest schema still tighter than both inputs. The schema union—useful for finding the common constraint core shared by related schemas.
- Lattice distance—A rigorous, non-probabilistic similarity measure between data definitions. Fundamentally different from fuzzy text matching on field names.
Application: Structural Catalog Search
When searching for components, lattice-aware search finds components whose constraint structure is similar—same numeric range, same units, same temporal resolution—even if the labels differ entirely. This is structural similarity, not lexical similarity.
Schema Differencing and Evolution
SDC schemas are never versioned. When a schema is superseded, the new schema's links field points backward to what it replaces. The old schema is immutable. This creates a directed acyclic graph of supersession.
- Constraint diff—Compute exactly which facets changed between a schema and its successor. The diff is computable because both schemas are formal restriction sets against the same reference model base type.
- Breaking change detection—A change is "breaking" if the successor's value space is not a superset of the predecessor's. Because the facet model is formal, this is decidable, not heuristic.
- Evolution trajectory—Track how constraints have tightened or loosened across a chain of supersessions. A leading indicator of schema maturity.
Formalized Failure: Exceptional Values
Real-world data is messy. IoT sensors fail slowly. Form fields get skipped. Measurement devices drift. A constraint model that rejects every non-conforming instance wholesale loses millions of dollars of partially valid data and destroys the failure signal.
SDC handles this with Exceptional Values (EVs)—16 formal types derived from ISO 21090 Null Flavours and extended for real-world usage. When an element fails validation, the failing element is replaced by a specific EV type that records why the value is absent or invalid. The rest of the instance is preserved with full validity. The constraint model does not bend—the violation is formally recorded.
The EV vocabulary goes far beyond simple "null" flags. It distinguishes 16 structurally distinct failure modes:
| Category | Code | Meaning |
|---|---|---|
| Missing | NI | No Information — general default for missing, omitted, or incomplete data |
UNK | Unknown — a proper value is applicable but not known | |
NASK | Not Asked — information was never sought | |
ASKU | Asked but Unknown — sought but not found (e.g., patient didn't know) | |
ASKR | Asked and Refused — sought but subject refused to provide | |
NAV | Not Available — not available, reason unknown | |
NA | Not Applicable — no proper value applies (e.g., cigarettes/day for a non-smoker) | |
| Security | MSK | Masked — value exists but withheld for security or privacy reasons |
| Validity | INV | Invalid — value is not a member of the permitted data values |
OTH | Other — value falls outside the coding system | |
UNC | Unencoded — raw source data, not properly encoded to constraints | |
| Derived | DER | Derived — value must be calculated from provided information |
| Measurement | NINF | Negative Infinity — reading below the instrument's measurable range |
PINF | Positive Infinity — reading above the instrument's measurable range | |
TRC | Trace — detected but too small to quantify | |
QS | Sufficient Quantity — non-zero but unspecified; constitutes the bulk of the material |
Each EV type is itself a restriction of the abstract ExceptionalValueType, with a fixed ev-name string. Domain models can further restrict this set or add domain-specific EV subtypes for additional failure modes. The element name is prefixed with ev- for machine-sortable filtering.
This creates a third category between "valid" and "rejected"—formally quarantined—and opens algorithmic opportunities:
Constraint-Bounded Imputation
Standard imputation guesses missing values from statistical patterns. In SDC, the EV-tagged element sits inside a restriction lattice that defines the exact legal value space. A missing XdQuantity with minInclusive=95, maxInclusive=105 constrains the imputation to that range. A missing XdString with enumeration=["A","B","C","D"] reduces imputation to a four-way classification. The result: constraint-bounded imputation is provably more accurate because the search space is formally reduced before the statistical model starts.
Predictive Maintenance via EV Trends
A sequence of EVs on the same element across successive instances is a degradation signal. The specific EV code carries diagnostic information: a shift from NAV (sensor not responding) to PINF (reading above instrument range) suggests the sensor is still alive but physically failing—a different maintenance response than total communication loss. A temperature sensor shifting from 99.9% valid to 95% over a week, with failures moving from TRC (trace amounts) to INV (out of range), reveals a calibration drift pattern. When EVs appear on correlated elements within the same Cluster, the failure may be systemic. This is predictive maintenance that requires zero additional instrumentation—the constraint model generates the failure signal as a side effect of validation.
Query-Time Quality Filtering
Because EVs are formal, machine-readable tags at the element level, downstream queries select their quality threshold: strict mode (fully validated data only), inclusive mode (partial data acceptable), or diagnostic mode (only failures). The graph knows its own quality at the element level. No external data quality layer needed.
Component Reuse and Cross-Domain Identity
SDC components are identified by immutable CUID2s. A compiled xsd:complexType for "systolic blood pressure in mmHg"—with its specific restriction facets, semantic links, and embedded RDF—is a reusable artifact. The same component can appear in a cardiology model, an emergency department model, and a clinical trial model. It is the same artifact, referenced by identity.
- Deterministic cross-domain graph bridges—When two models share a reused component, the extracted graphs are connected by that shared CUID2 identity. Not probabilistically matched. Identical by construction.
- Reuse-weighted catalog metrics—A component used in 15 models across 4 domains has been validated against diverse real-world data. Reuse count is a quality signal.
- Interoperability scoring—The ratio of shared components to total components is a deterministic, precise metric computable from schema definitions alone, before any data exists.
- Transfer learning via shared components—Reused components provide anchor points that align embedding spaces across domains without explicit transfer learning setup.
The SHACL Layer
SHACL shapes are compiled from the same model instances as the XSD, expressing constraints in RDF-native terms. This creates a second formal system for algorithms to exploit.
Shape Analysis and Query Generation
- Shape composition and decomposition—Decompose complex Cluster shapes into minimal independent constraint groups. Reveals which components can be validated independently versus which are coupled by cross-references.
- SHACL-to-SPARQL query generation—Mechanically transform shapes into SPARQL queries. Each
sh:propertypath becomes a graph pattern; each constraint becomes a FILTER. This enables constraint checking inside systems that support SPARQL but not SHACL natively. - Shape coverage testing—Given a shape and a graph, compute conformance rates, violation clustering, and shape tightening recommendations based on the conforming population.
Schema-Level Semantics
The RDF triples embedded in xsd:appinfo are schema-level semantic annotations. Each component carries rdf:type, rdfs:label, rdfs:comment, and semantic links from W3C/BFO vocabularies pointing to ontology URIs.
Ontological Analysis and Similarity
- Ontological consistency checking—If a component uses predicate
schema:hasUnitwith objectqudt:Kilogram, verify that the predicate's domain/range declarations are consistent with the component's type. An XdString annotated withqudt:Kilogramis semantically suspicious. - Three-layer semantic similarity—Two components are similar if they share ontology links, restrict the same reference model base type, and have overlapping constraint spaces. This triple metric is rigorously computable from the schema, not statistically estimated from data.
- Reuse discovery via schema-level graph analysis—Community detection on the schema-level graph identifies natural component groupings. Graph traversal finds reuse candidates by ontological proximity, without text search.
Graph Machine Learning
Only after the constraint, SHACL, and semantic layers do we reach graph ML on instance data. The key difference: every node and edge was admitted by a formally verified constraint pipeline. The algorithms do not need to learn the "laws of physics"—those are compiled into the schema. They learn the "weather."
Knowledge Graph Embeddings
Standard embedding methods (TransE, ComplEx, DistMult, RotatE, TuckER) translate RDF triples into dense vectors. SDC's tight predicate vocabulary keeps the embedding space dense. A graph with 40 well-defined predicates produces vastly better embeddings than one with 4,000 ad-hoc predicates.
SDC enhancement: Use the constraint model as embedding initialization. Instead of random initialization, initialize entity embeddings based on their schema's position in the restriction lattice. Reused components share embeddings by identity—they are the same node, not similar nodes. This provides deterministic anchor points that align embedding spaces across domains.
Relational Graph Neural Networks
R-GCNs allocate a weight matrix per relation type. SDC's constrained predicate set prevents the parameter explosion that kills R-GCNs on undisciplined graphs. Heterogeneous Graph Transformers (HGTs) and Relational Graph Attention Networks (R-GATs) extend this with attention mechanisms.
SDC enhancement: Use SHACL shapes as attention priors. If the shape says a property is required (vs. optional), the attention mechanism weights that relation higher initially. The model still learns, but starts from a structurally informed position.
GraphRAG and Quality-Stratified ML
SDC's self-describing instances (root element sdc4:dm-{ct_id} + xsi:schemaLocation) mean every triple carries full provenance back to the compiled schema. GraphRAG over SDC data can cite not just "this fact came from this graph" but "this fact was validated against constraint X in schema Y."
Every graph algorithm can be run in quality-stratified mode by filtering on EV presence: train on clean subgraphs only, build separate failure embeddings, bias random walks to avoid or target EV-tagged nodes, and report confidence levels that include quality provenance in LLM citations.
The Research Horizon
These algorithms do not exist yet. They become possible only with formal constraints at the quality level SDC provides.
Constraint-Aware Neural Networks
Encode XSD restriction facets and SHACL shapes as loss function constraints, analogous to Physics-Informed Neural Networks (PINNs). The model is penalized for any prediction that violates a compiled constraint. Training speedup is orders of magnitude—the constraint set eliminates entire regions of the parameter space.
Cross-Domain Federated Learning
Component reuse means domain graphs share actual schema artifacts—same CUID2, same constraint model, same semantic links. Models trained on supply chain data and clinical data share a common semantic coordinate system anchored by reused components. Federated learning across domains without manual schema reconciliation.
Constrained LLM Decoding
The compiled XSD restriction model is mechanically translatable into a context-free grammar. During LLM inference, a grammar mask eliminates any next-token candidate that would violate the schema. The model is physically incapable of generating a payload that violates the constraints.
Libraries implementing grammar-constrained decoding already exist (Outlines, llama.cpp GBNF, Microsoft Guidance). What is missing is the compiler that translates an SDC XSD restriction set into the grammar format these engines consume. The XSD facet model maps cleanly to production rules in a context-free grammar.
Schema-Guided Synthetic Data
Generate synthetic training data by sampling from the constraint space defined by a schema. Instead of GANs learning the data distribution from examples, the schema defines the valid distribution. Every synthetic instance is guaranteed valid by construction. This solves the "not enough training data" problem for schema-constrained domains—particularly in air-gapped environments where real data cannot be exposed.
Schema-Space Federated Query Routing
In a distributed enterprise, current approaches search actual data to find answers. Schema-space routing runs algorithms over compiled schemas—their embedded RDF annotations, BFO groundings, and constraint structures—to determine which systems hold relevant data without touching a single record. Query planning happens in the schema space; data access happens only at the final execution step, under full access controls.
The standard industry approach treats the graph as the starting point and tries to learn constraints from data. SDC inverts this: constraints are compiled first, the graph is populated under those constraints, and algorithms operate on data whose formal properties are already known. This inversion changes what is algorithmically possible.
Built on Formal Foundations
The algorithmic opportunities described here are consequences of SDC's theoretical architecture. See how zero-entropy semantics and two-level modeling create the formal substrate that makes constraint-first computing possible.
About Axius SDC
The Semantic Data Charter is developed by Axius SDC, Inc., an international team with 40+ years combined experience in semantic data and health informatics across the United States, Canada, and Brazil.
Learn more about our team