Data Model#
ops-db doesn’t store actual data files - it tracks metadata about where files exist. This is crucial for distributed data management across multiple sites and storage types.
File Hierarchy#
Data files are organized in a hierarchy from individual files to transfer bundles:
graph TB
RDF[RawDataFile]
RDP[RawDataPackage]
DTP[DataTransferPackage]
RDF -->|bundled into| RDP
RDP -->|bundled into| DTP
style RDF fill:#e1f5ff
style RDP fill:#fff4e1
style DTP fill:#ffe1f5
RawDataFile#
RawDataFile represents an individual data file produced
by an instrument module. Uses UUID because files are created by telescope systems that
may be offline. See also Operations Database API (ops-db-api).
For complete attribute details, see RawDataFile.
RawDataPackage#
RawDataPackage is a bundled collection of related
RawDataFile objects packaged as a tar archive. The
RawDataPackage groups files per ExecutedObsUnit and InstrumentModule. Thousands of
small files are inefficient for archiving. Packaging in this ways consolidates them into
manageable units of closely related files that will have to be processed together. The
packaging preserves directory structure and metadata so that when the data is unpacked
the original directory structure is restored.
- State Meanings:
WAITING (yellow hourglass in UI): Only exists in primary location
TRANSFERRING (blue circle): Part of an active DataTransferPackage
ARCHIVED (green checkmark): Successfully stored in long-term archive
FAILED (red cross): Transfer or archive failed
For complete attribute details, see RawDataPackage.
RawDataPackageMetadata#
RawDataPackageMetadata stores additional metadata for
IVOA-compatible metadata generation. Keeps the core RawDataPackage
model clean while allowing flexible metadata storage.
For complete attribute details, see RawDataPackageMetadata.
DataTransferPackage#
DataTransferPackage bundles multiple
RawDataPackage objects for efficient network transfer.
Optimizes network transfer efficiency - many packages → fewer transfer operations. For
long distance transfers, optimal package sizes exist in the range of 10-50TB. One
DataTransferPackage can have multiple
DataTransfer records (same bundle to multiple destinations).
For complete attribute details, see DataTransferPackage.
Physical Copy Tracking#
The Physical Copy System#
PhysicalCopy tracks where each file/package physically
exists across all storage locations. Data can exist in multiple places simultaneously
(buffer, archive, staging area), enabling safe deletion and staged unpacking. It is
polymorphic with subclasses:
RawDataFilePhysicalCopy,
RawDataPackagePhysicalCopy, and
DataTransferPackagePhysicalCopy. Each subclass has a
full_path property that constructs the actual filesystem/S3 path.
For complete attribute details, see PhysicalCopy and its
subclasses.
PhysicalCopyStatus Enum#
PhysicalCopyStatus tracks the lifecycle state of a
physical copy:
Status |
Meaning |
|---|---|
PRESENT |
File exists and is available |
STAGED |
Package unpacked, original archive removed to save space |
DELETION_POSSIBLE |
Eligible for cleanup (exists in other locations) |
DELETION_PENDING |
Scheduled for removal |
DELETION_SCHEDULED |
Cleanup task queued |
DELETION_IN_PROGRESS |
Currently being deleted |
DELETION_FAILED |
Deletion attempt failed |
DELETED |
Successfully removed |
For complete enum details, see PhysicalCopyStatus.
Physical Copy Relationships#
graph TB
RDP[RawDataPackage]
PC1[PhysicalCopy<br/>at Location 1]
PC2[PhysicalCopy<br/>at Location 2]
PC3[PhysicalCopy<br/>at Location 3]
DL1[DataLocation 1<br/>Chile Buffer]
DL2[DataLocation 2<br/>Cologne Archive]
DL3[DataLocation 3<br/>Processing]
RDP -->|has| PC1
RDP -->|has| PC2
RDP -->|has| PC3
PC1 -->|at| DL1
PC2 -->|at| DL2
PC3 -->|at| DL3
style RDP fill:#e1f5ff
style PC1 fill:#fff4e1
style PC2 fill:#fff4e1
style PC3 fill:#fff4e1
- Example: A
RawDataPackagemight have 3 physical copies: One PRESENT at Chile buffer
One PRESENT at Cologne archive
One STAGED at processing location (unpacked, archive removed)
Status and State Management#
Status Enum#
Status is used for operations (transfer, archive, staging):
Status |
Meaning |
|---|---|
PENDING |
Queued but not started |
SCHEDULED |
Assigned to worker |
IN_PROGRESS |
Currently executing |
COMPLETED |
Finished successfully |
FAILED |
Failed and won’t retry |
For complete enum details, see Status.
PackageState Enum#
PackageState is used for data lifecycle:
State |
Meaning |
|---|---|
WAITING |
Only in primary location |
TRANSFERRING |
Being transferred |
ARCHIVED |
In long-term archive |
FAILED |
Operation failed |
For complete enum details, see PackageState.
Why This Structure?#
Separation of File and Package Levels
Allows tracking at two granularities:
File-level: For detailed provenance and access
Package-level: For efficient transfer and storage management
Physical Copy Tracking
Enables:
Knowing exactly where each copy exists
Safe deletion (only delete if copies exist elsewhere)
Staged unpacking (remove archive after extraction to save space)
Status/State Separation
Allows:
Tracking operational progress (status)
Tracking data lifecycle state (state)
Retry logic and failure handling
Relationship to Data Transfer#
The Data Transfer System package orchestrates moving data based on these records. ops-db just tracks metadata - data-transfer does the actual file operations.
For detailed data flow documentation, see the Data Transfer System documentation.