Data Model#

Documentation Verified Last checked: 2025-11-25 Reviewer: Christof Buchbender

ops-db doesn’t store actual data files - it tracks metadata about where files exist. This is crucial for distributed data management across multiple sites and storage types.

File Hierarchy#

Data files are organized in a hierarchy from individual files to transfer bundles:

        graph TB
    RDF[RawDataFile]
    RDP[RawDataPackage]
    DTP[DataTransferPackage]

    RDF -->|bundled into| RDP
    RDP -->|bundled into| DTP

    style RDF fill:#e1f5ff
    style RDP fill:#fff4e1
    style DTP fill:#ffe1f5
    

RawDataFile#

RawDataFile represents an individual data file produced by an instrument module. Uses UUID because files are created by telescope systems that may be offline. See also Operations Database API (ops-db-api).

For complete attribute details, see RawDataFile.

RawDataPackage#

RawDataPackage is a bundled collection of related RawDataFile objects packaged as a tar archive. The RawDataPackage groups files per ExecutedObsUnit and InstrumentModule. Thousands of small files are inefficient for archiving. Packaging in this ways consolidates them into manageable units of closely related files that will have to be processed together. The packaging preserves directory structure and metadata so that when the data is unpacked the original directory structure is restored.

State Meanings:
  • WAITING (yellow hourglass in UI): Only exists in primary location

  • TRANSFERRING (blue circle): Part of an active DataTransferPackage

  • ARCHIVED (green checkmark): Successfully stored in long-term archive

  • FAILED (red cross): Transfer or archive failed

For complete attribute details, see RawDataPackage.

RawDataPackageMetadata#

RawDataPackageMetadata stores additional metadata for IVOA-compatible metadata generation. Keeps the core RawDataPackage model clean while allowing flexible metadata storage.

For complete attribute details, see RawDataPackageMetadata.

DataTransferPackage#

DataTransferPackage bundles multiple RawDataPackage objects for efficient network transfer. Optimizes network transfer efficiency - many packages → fewer transfer operations. For long distance transfers, optimal package sizes exist in the range of 10-50TB. One DataTransferPackage can have multiple DataTransfer records (same bundle to multiple destinations).

For complete attribute details, see DataTransferPackage.

Physical Copy Tracking#

The Physical Copy System#

PhysicalCopy tracks where each file/package physically exists across all storage locations. Data can exist in multiple places simultaneously (buffer, archive, staging area), enabling safe deletion and staged unpacking. It is polymorphic with subclasses: RawDataFilePhysicalCopy, RawDataPackagePhysicalCopy, and DataTransferPackagePhysicalCopy. Each subclass has a full_path property that constructs the actual filesystem/S3 path.

For complete attribute details, see PhysicalCopy and its subclasses.

PhysicalCopyStatus Enum#

PhysicalCopyStatus tracks the lifecycle state of a physical copy:

PhysicalCopyStatus Values#

Status

Meaning

PRESENT

File exists and is available

STAGED

Package unpacked, original archive removed to save space

DELETION_POSSIBLE

Eligible for cleanup (exists in other locations)

DELETION_PENDING

Scheduled for removal

DELETION_SCHEDULED

Cleanup task queued

DELETION_IN_PROGRESS

Currently being deleted

DELETION_FAILED

Deletion attempt failed

DELETED

Successfully removed

For complete enum details, see PhysicalCopyStatus.

Physical Copy Relationships#

        graph TB
    RDP[RawDataPackage]
    PC1[PhysicalCopy<br/>at Location 1]
    PC2[PhysicalCopy<br/>at Location 2]
    PC3[PhysicalCopy<br/>at Location 3]
    DL1[DataLocation 1<br/>Chile Buffer]
    DL2[DataLocation 2<br/>Cologne Archive]
    DL3[DataLocation 3<br/>Processing]

    RDP -->|has| PC1
    RDP -->|has| PC2
    RDP -->|has| PC3
    PC1 -->|at| DL1
    PC2 -->|at| DL2
    PC3 -->|at| DL3

    style RDP fill:#e1f5ff
    style PC1 fill:#fff4e1
    style PC2 fill:#fff4e1
    style PC3 fill:#fff4e1
    
Example: A RawDataPackage might have 3 physical copies:
  • One PRESENT at Chile buffer

  • One PRESENT at Cologne archive

  • One STAGED at processing location (unpacked, archive removed)

Status and State Management#

Status Enum#

Status is used for operations (transfer, archive, staging):

Status Values#

Status

Meaning

PENDING

Queued but not started

SCHEDULED

Assigned to worker

IN_PROGRESS

Currently executing

COMPLETED

Finished successfully

FAILED

Failed and won’t retry

For complete enum details, see Status.

PackageState Enum#

PackageState is used for data lifecycle:

PackageState Values#

State

Meaning

WAITING

Only in primary location

TRANSFERRING

Being transferred

ARCHIVED

In long-term archive

FAILED

Operation failed

For complete enum details, see PackageState.

Why This Structure?#

Separation of File and Package Levels

Allows tracking at two granularities:

  • File-level: For detailed provenance and access

  • Package-level: For efficient transfer and storage management

Physical Copy Tracking

Enables:

  • Knowing exactly where each copy exists

  • Safe deletion (only delete if copies exist elsewhere)

  • Staged unpacking (remove archive after extraction to save space)

Status/State Separation

Allows:

  • Tracking operational progress (status)

  • Tracking data lifecycle state (state)

  • Retry logic and failure handling

Relationship to Data Transfer#

The Data Transfer System package orchestrates moving data based on these records. ops-db just tracks metadata - data-transfer does the actual file operations.

For detailed data flow documentation, see the Data Transfer System documentation.