Transfer Model#

Documentation Verified Last checked: 2025-11-25 Reviewer: Christof Buchbender

ops-db tracks data movement operations but doesn’t perform them. The Data Transfer System package reads these records to orchestrate actual file transfers. This section covers the “how do we route and track transfers” infrastructure.

Routing Infrastructure#

DataTransferRoute#

DataTransferRoute defines how data should flow between sites. Routes are defined at site level, with optional location-level overrides. This decouples transfer logic from hardcoded paths and allows dynamic routing based on network conditions.

Example: “ccat_to_cologne” route: origin_site=CCAT, destination_site=Cologne, method=bbcp

RouteType Enum: DIRECT (skip destination buffer), RELAY (route through intermediate site), CUSTOM (location-to-location override).

For complete attribute details, see DataTransferRoute.

Transfer Packages#

DataTransferPackage#

DataTransferPackage bundles multiple RawDataPackage objects for efficient network transfer. Optimizes network transfer efficiency - many packages → fewer transfer operations. For long distance transfers, optimal package sizes exist in the range of 10-50TB. One DataTransferPackage can have multiple DataTransfer records (same bundle to multiple destinations).

For complete attribute details, see DataTransferPackage.

Transfer Operations#

DataTransfer#

DataTransfer records a specific transfer operation from origin to destination. Separates transfer (move the archive) from unpacking (extract at destination), allowing distinguishing transfer failures from unpacking failures and retrying unpacking without re-transferring.

For complete attribute details, see DataTransfer.

DataTransferLog#

DataTransferLog provides lightweight log entries with references to detailed log files. Avoids storing large log text in database - only stores path. Full command outputs are stored in files, detailed metrics are stored in InfluxDB.

For complete attribute details, see DataTransferLog.

Transfer Flow#

        sequenceDiagram
    participant Source
    participant Buffer1
    participant DTP as DataTransferPackage
    participant DT as DataTransfer
    participant Buffer2
    participant Archive

    Source->>Buffer1: Package RawDataPackages
    Buffer1->>DTP: Create DataTransferPackage
    DTP->>DT: Create DataTransfer
    DT->>Buffer2: Transfer archive
    DT->>Buffer2: Unpack RawDataPackages
    Buffer2->>Archive: Archive RawDataPackages
    

Archive Operations#

LongTermArchiveTransfer#

LongTermArchiveTransfer tracks transfer of RawDataPackage objects to permanent archive storage. Separate from DataTransfer because this transfer happens within a Site between DataLocation of type BUFFER` and LONG_TERM_ARCHIVE and not between two different sites BUFFERS like the DataTransfer does.

For complete attribute details, see LongTermArchiveTransfer.

StagingJob#

StagingJob makes archived data available for scientific processing. Downloads from long-term archive → unpacks → creates file access records. Multiple packages can be staged together for efficiency. Staging is on-demand (scientist requests data), unlike archive which is fire-and-forget.

For complete attribute details, see StagingJob.

Archive and Staging Flow#

        sequenceDiagram
    participant Buffer
    participant Archive
    participant LTA as LongTermArchiveTransfer
    participant Staging
    participant Processing

    Buffer->>LTA: Create archive transfer
    LTA->>Archive: Transfer RawDataPackage
    Archive->>Staging: Scientist requests data
    Staging->>Processing: Stage packages
    Processing->>Archive: Store results
    

Why This Structure?#

Separation of Routing and Execution

Allows:

  • Updating routes without affecting transfer history

  • Multiple transfer strategies

  • Dynamic routing based on conditions

DataTransferPackage Bundling

Optimizes network usage:

  • Reduces overhead of many small transfers

  • Enables resumable transfers

  • Optimal package sizes (10-50TB for long distance)

Separate Transfer and Unpacking Tracking

Allows:

  • Distinguishing transfer failures from unpacking failures

  • Retrying unpacking without re-transferring

  • Better error diagnosis

LongTermArchiveTransfer and StagingJob Separation

Different workflows:

  • Archive is fire-and-forget (automatic)

  • Staging is on-demand (scientist requests)

Retry and Failure Handling#

DataTransfer: Retries are handled by the data-transfer package with exponential backoff. Failed transfers can be manually retried.

LongTermArchiveTransfer: After 3 attempts, the requires_intervention property returns True, indicating manual intervention is needed.

StagingJob: Retries follow the same pattern as DataTransfer, but staging failures are less critical (data is still in archive).

Integration with Data Transfer Package#

The Data Transfer System Python package implements the actual transfer operations:

  1. Reads these records to determine what needs to be transferred

  2. Performs the actual file transfers using BBCP, S3, or other methods

  3. Updates status and physical copy records as transfers complete

For detailed workflow documentation, see the Data Transfer System documentation.