Transfer Model#
ops-db tracks data movement operations but doesn’t perform them. The Data Transfer System package reads these records to orchestrate actual file transfers. This section covers the “how do we route and track transfers” infrastructure.
Routing Infrastructure#
DataTransferRoute#
DataTransferRoute defines how data should flow between
sites. Routes are defined at site level, with optional location-level overrides. This
decouples transfer logic from hardcoded paths and allows dynamic routing based on network
conditions.
Example: “ccat_to_cologne” route: origin_site=CCAT, destination_site=Cologne, method=bbcp
RouteType Enum: DIRECT (skip destination buffer), RELAY (route through intermediate site), CUSTOM (location-to-location override).
For complete attribute details, see DataTransferRoute.
Transfer Packages#
DataTransferPackage#
DataTransferPackage bundles multiple
RawDataPackage objects for efficient network transfer.
Optimizes network transfer efficiency - many packages → fewer transfer operations. For
long distance transfers, optimal package sizes exist in the range of 10-50TB. One
DataTransferPackage can have multiple
DataTransfer records (same bundle to multiple destinations).
For complete attribute details, see DataTransferPackage.
Transfer Operations#
DataTransfer#
DataTransfer records a specific transfer operation from
origin to destination. Separates transfer (move the archive) from unpacking (extract at
destination), allowing distinguishing transfer failures from unpacking failures and
retrying unpacking without re-transferring.
For complete attribute details, see DataTransfer.
DataTransferLog#
DataTransferLog provides lightweight log entries with
references to detailed log files. Avoids storing large log text in database - only stores
path. Full command outputs are stored in files, detailed metrics are stored in InfluxDB.
For complete attribute details, see DataTransferLog.
Transfer Flow#
sequenceDiagram
participant Source
participant Buffer1
participant DTP as DataTransferPackage
participant DT as DataTransfer
participant Buffer2
participant Archive
Source->>Buffer1: Package RawDataPackages
Buffer1->>DTP: Create DataTransferPackage
DTP->>DT: Create DataTransfer
DT->>Buffer2: Transfer archive
DT->>Buffer2: Unpack RawDataPackages
Buffer2->>Archive: Archive RawDataPackages
Archive Operations#
LongTermArchiveTransfer#
LongTermArchiveTransfer tracks transfer of
RawDataPackage objects to permanent archive storage.
Separate from DataTransfer because this transfer happens
within a Site between
DataLocation of type BUFFER` and LONG_TERM_ARCHIVE
and not between two different sites BUFFERS like the
DataTransfer does.
For complete attribute details, see LongTermArchiveTransfer.
StagingJob#
StagingJob makes archived data available for scientific
processing. Downloads from long-term archive → unpacks → creates file access records.
Multiple packages can be staged together for efficiency. Staging is on-demand (scientist
requests data), unlike archive which is fire-and-forget.
For complete attribute details, see StagingJob.
Archive and Staging Flow#
sequenceDiagram
participant Buffer
participant Archive
participant LTA as LongTermArchiveTransfer
participant Staging
participant Processing
Buffer->>LTA: Create archive transfer
LTA->>Archive: Transfer RawDataPackage
Archive->>Staging: Scientist requests data
Staging->>Processing: Stage packages
Processing->>Archive: Store results
Why This Structure?#
Separation of Routing and Execution
Allows:
Updating routes without affecting transfer history
Multiple transfer strategies
Dynamic routing based on conditions
DataTransferPackage Bundling
Optimizes network usage:
Reduces overhead of many small transfers
Enables resumable transfers
Optimal package sizes (10-50TB for long distance)
Separate Transfer and Unpacking Tracking
Allows:
Distinguishing transfer failures from unpacking failures
Retrying unpacking without re-transferring
Better error diagnosis
LongTermArchiveTransfer and StagingJob Separation
Different workflows:
Archive is fire-and-forget (automatic)
Staging is on-demand (scientist requests)
Retry and Failure Handling#
DataTransfer: Retries are handled by the data-transfer package with exponential backoff. Failed transfers can be manually retried.
LongTermArchiveTransfer: After 3 attempts, the requires_intervention property
returns True, indicating manual intervention is needed.
StagingJob: Retries follow the same pattern as DataTransfer, but staging failures are less critical (data is still in archive).
Integration with Data Transfer Package#
The Data Transfer System Python package implements the actual transfer operations:
Reads these records to determine what needs to be transferred
Performs the actual file transfers using BBCP, S3, or other methods
Updates status and physical copy records as transfers complete
For detailed workflow documentation, see the Data Transfer System documentation.