Skip to content

Fidelity

cifflow.fidelity.check

Fidelity comparison for CIF sources.

check_fidelity compares two CIF sources — files, paths, or pre-parsed CifFile objects — by ingesting both into in-memory SQLite databases and comparing the resulting data at the row level.

Known limitations

ValueType for structured tables ValueType is not stored for structured table columns; only the raw string value is persisted. ValueType fidelity for schema-known tags is therefore not checkable. For _cif_fallback, value_type is stored and compared directly.

SU fidelity in _cif_fallback For structured tables, SU columns are normalised with Decimal.normalize() so that 0.001 and 0.0010 compare equal. For _cif_fallback, SU values are embedded in the full value(su) string (e.g. 3.992(1)) and are compared as raw strings. Equivalent SU representations such as 3.992(1) and 3.9920(10) will compare as unequal.

Default-filled values (_cif_synthetic) Values filled from enumeration_default during ingestion are excluded from comparison. An explicit value in one source and a default-filled value in the other will produce a "row_content" mismatch even if identical. (_cif_synthetic is specced but not yet implemented in the ingestion layer; this step is a no-op until it is.)

version parameter The version parameter is not yet propagated to the parser as a fallback default. Version detection uses the file magic line; files without a magic line are parsed as CIF 1.1 regardless of version.

UUID-keyed tables When comparing sources where one uses natural primary keys and another uses generated UUID keys (e.g. ALL_BLOCKS output merging multiple CIF blocks), all PK columns of UUID-keyed tables and all FK columns pointing to those tables are stripped from the row representation in both connections. This allows content comparison without key-structure comparison.

FidelityReport dataclass

Result of a :func:check_fidelity call.

Attributes:

Name Type Description
passed bool

True when no mismatches were found.

mismatches list[FidelityMismatch]

Ordered list of all :class:FidelityMismatch objects found.

Source code in src/cifflow/fidelity/check.py
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
@dataclass
class FidelityReport:
    """Result of a :func:`check_fidelity` call.

    Attributes
    ----------
    passed
        ``True`` when no mismatches were found.
    mismatches
        Ordered list of all :class:`FidelityMismatch` objects found.
    """

    passed: bool
    mismatches: list[FidelityMismatch]

FidelityMismatch dataclass

A single semantic difference found between two CIF sources.

Attributes:

Name Type Description
kind str

Machine-readable category (e.g. 'missing_block', 'value_mismatch').

source Literal['a', 'b', 'both']

Which source the mismatch is tied to: 'a', 'b', or 'both'.

description str

Human-readable explanation of the difference.

Source code in src/cifflow/fidelity/check.py
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
@dataclass
class FidelityMismatch:
    """A single semantic difference found between two CIF sources.

    Attributes
    ----------
    kind
        Machine-readable category (e.g. ``'missing_block'``, ``'value_mismatch'``).
    source
        Which source the mismatch is tied to: ``'a'``, ``'b'``, or ``'both'``.
    description
        Human-readable explanation of the difference.
    """

    kind: str
    source: Literal['a', 'b', 'both']
    description: str

check_fidelity(source_a, source_b, schema=None, *, version=CifVersion.CIF_2_0, report_file=None)

Compare two CIF sources for semantic equivalence.

Parameters:

Name Type Description Default
source_a 'str | pathlib.Path | CifFile'

First CIF source to compare. May be a file path (str or pathlib.Path) or a pre-parsed CifFile object.

required
source_b 'str | pathlib.Path | CifFile'

Second CIF source to compare. Same accepted types as source_a.

required
schema 'str | pathlib.Path | SchemaSpec | dict | None'

Schema to use for ingestion. None compares only _cif_fallback. Accepts SchemaSpec, .json cache path, or .dic DDLm dictionary path.

None
version CifVersion

Fallback CIF version for files without a magic line. Default CIF_2_0.

CIF_2_0
report_file 'str | pathlib.Path | None'

Optional path for a human-readable text report. If provided, the report is written (UTF-8) before returning, regardless of pass/fail.

None

Returns:

Type Description
FidelityReport

Parse and ingestion errors are captured in the report; never raises for data errors. Schema loading failures propagate directly.

Source code in src/cifflow/fidelity/check.py
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
def check_fidelity(
    source_a: 'str | pathlib.Path | CifFile',
    source_b: 'str | pathlib.Path | CifFile',
    schema: 'str | pathlib.Path | SchemaSpec | dict | None' = None,
    *,
    version: CifVersion = CifVersion.CIF_2_0,
    report_file: 'str | pathlib.Path | None' = None,
) -> FidelityReport:
    """Compare two CIF sources for semantic equivalence.

    Parameters
    ----------
    source_a
        First CIF source to compare.  May be a file path (``str`` or
        ``pathlib.Path``) or a pre-parsed ``CifFile`` object.
    source_b
        Second CIF source to compare.  Same accepted types as *source_a*.
    schema
        Schema to use for ingestion.  ``None`` compares only
        ``_cif_fallback``.  Accepts ``SchemaSpec``, ``.json`` cache path, or
        ``.dic`` DDLm dictionary path.
    version
        Fallback CIF version for files without a magic line.  Default
        ``CIF_2_0``.
    report_file
        Optional path for a human-readable text report.  If provided, the
        report is written (UTF-8) before returning, regardless of pass/fail.

    Returns
    -------
    FidelityReport
        Parse and ingestion errors are captured in the report; never raises
        for data errors.  Schema loading failures propagate directly.
    """
    mismatches: list[FidelityMismatch] = []

    def _label(src: object) -> str:
        if isinstance(src, CifFile):
            return 'CifFile object'
        return str(src)

    label_a = _label(source_a)
    label_b = _label(source_b)

    def _finish(ms: list[FidelityMismatch]) -> FidelityReport:
        rep = FidelityReport(passed=len(ms) == 0, mismatches=ms)
        if report_file is not None:
            pathlib.Path(report_file).write_text(
                _format_report(rep, label_a, label_b, schema_spec), encoding='utf-8'
            )
        return rep

    # Schema loading — propagates on failure (programming error)
    schema_spec = _load_schema(schema)

    # --- Step 1: load and parse sources ---
    cif_a, parse_errors_a = _load_source(source_a, version)
    for e in parse_errors_a:
        loc = f' at line {e.line}' if e.line else ''
        mismatches.append(FidelityMismatch(
            kind='parse_error', source='a',
            description=f'{e.error_type} error in A{loc}: {e.message}',
        ))

    cif_b, parse_errors_b = _load_source(source_b, version)
    for e in parse_errors_b:
        loc = f' at line {e.line}' if e.line else ''
        mismatches.append(FidelityMismatch(
            kind='parse_error', source='b',
            description=f'{e.error_type} error in B{loc}: {e.message}',
        ))

    if any(m.kind == 'parse_error' for m in mismatches):
        return _finish(mismatches)

    # --- Step 1 (continued): ingest ---
    conn_a = duckdb.connect()
    conn_b = duckdb.connect()

    ingest_ok_a = True
    ingest_ok_b = True

    try:
        conn_a, errors_a = ingest(cif_a, conn_a, schema=schema_spec)
        for msg in errors_a:
            mismatches.append(FidelityMismatch(
                kind='ingest_error', source='a', description=msg,
            ))
    except Exception as exc:
        ingest_ok_a = False
        mismatches.append(FidelityMismatch(
            kind='ingest_error', source='a', description=str(exc),
        ))

    try:
        conn_b, errors_b = ingest(cif_b, conn_b, schema=schema_spec)
        for msg in errors_b:
            mismatches.append(FidelityMismatch(
                kind='ingest_error', source='b', description=msg,
            ))
    except Exception as exc:
        ingest_ok_b = False
        mismatches.append(FidelityMismatch(
            kind='ingest_error', source='b', description=str(exc),
        ))

    if not ingest_ok_a or not ingest_ok_b:
        return _finish(mismatches)

    # --- Step 2: detect UUID-keyed tables ---
    if schema_spec is not None:
        uuid_tbls = _uuid_pk_tables(conn_a, conn_b, schema_spec)
        uuid_fk_cols = _fk_to_uuid_cols(schema_spec, uuid_tbls)
    else:
        uuid_tbls = frozenset()
        uuid_fk_cols = {}

    # --- Step 3: compare structured tables ---
    if schema_spec is not None:
        mismatches.extend(
            _compare_structured(conn_a, conn_b, schema_spec, uuid_tbls, uuid_fk_cols)
        )

    # --- Step 4: compare _cif_fallback ---
    mismatches.extend(_compare_fallback(conn_a, conn_b))

    # --- Step 5: schema mismatch detection ---
    if schema_spec is not None:
        mismatches.extend(_compare_schema_mismatch(conn_a, conn_b, schema_spec))

    return _finish(mismatches)