Data Lifecycle Management#

✓

Documentation Verified Last checked: 2025-11-06 Reviewer: Christof Buchbender

TBD needs cleanup to make function refererences work.

The Data Transfer System implements intelligent data lifecycle policies that balance storage efficiency with data safety. This document explains how data moves through its lifecycle and when deletion operations occur.

Lifecycle Philosophy#

Safety First

Data is never deleted from source locations until:

Verified to exist in at least one long-term archive
Checksums validated at destination
Archive marked as ARCHIVED in database

Storage Efficiency

Temporary copies are cleaned up based on:

Buffer location status and disk pressure
Completion of transfers and unpacking operations
Processing job completion and retention policies

Data Flow

Data progresses through stages with different retention policies:

SOURCE (Raw Files)
↓ Package into tar
BUFFER (Temporary at source site)
↓ Transfer between sites
BUFFER (Temporary at LTA site)
↓ Unpack and move to permanent storage
LONG_TERM_ARCHIVE (Permanent)
↓ Stage for processing (optional)
PROCESSING (Temporary)
↓ Cleanup after analysis completes
[Deleted from temporary locations]

Data States#

PhysicalCopyStatus#

ccat_ops_db.models.PhysicalCopyStatus is the state of data at each physical location. The primary states are:

PRESENT - File exists and is accessible at this location
STAGED - Package unpacked (used in PROCESSING locations), archive deleted to save space
DELETION_SCHEDULED - Marked for deletion, task queued
DELETION_IN_PROGRESS - Currently being deleted
DELETION_POSSIBLE - (RawDataFiles only) Parent package deleted, eligible for conditional deletion
DELETED - Successfully removed from this location
DELETION_FAILED - Deletion attempt failed

State Transitions#

For normal deletions:

PRESENT → DELETION_SCHEDULED → DELETION_IN_PROGRESS → DELETED

For failed deletions:

DELETION_IN_PROGRESS → FAILED (marked for retry)

For RawDataFiles in SOURCE/BUFFER locations:

PRESENT → DELETION_POSSIBLE → DELETION_SCHEDULED → ...

PackageState#

ccat_ops_db.models.PackageState is the high-level state of packages in the pipeline:

WAITING - At source location only, not yet transferred
TRANSFERRING - Part of active DataTransferPackage
ARCHIVED - Safe in long-term archive storage
FAILED - Transfer or archive operation failed

Safety Rule: Data at SOURCE locations is only eligible for deletion when package state is ARCHIVED.

Deletion Manager#

The ccat_data_transfer.deletion_manager module implements all cleanup policies and runs continuously to process eligible data.

Main Entry Point#

ccat_data_transfer.deletion_manager.delete_data_packages(verbose=False)[source]: Main entry point for deletion operations.

This is the main orchestration function that coordinates all deletion operations:

def delete_data_packages(verbose=False):
    """Main entry point for deletion operations."""

    logger.debug("###### Starting Deletion Manager ######")
    delete_data_transfer_packages(verbose)
    delete_raw_data_packages_bulk(verbose)
    delete_processing_raw_data_files(verbose)
    delete_staged_raw_data_files_from_processing(
        verbose
    )  # New function for staged files

    # Process DELETION_POSSIBLE files across all locations
    db = DatabaseConnection()
    session, _ = db.get_connection()
    try:
        # Get all active locations that might have DELETION_POSSIBLE files
        locations = (
            session.query(models.DataLocation)
            .filter(
                models.DataLocation.active == True,  # noqa: E712
                models.DataLocation.location_type.in_(
                    [
                        models.LocationType.SOURCE,
                        models.LocationType.BUFFER,
                    ]
                ),
            )
            .all()
        )

        for location in locations:
            try:
                process_deletion_possible_raw_data_files(session, location)
                session.commit()
            except Exception as e:
                logger.error(
                    f"Error processing DELETION_POSSIBLE files for location {location.name}: {str(e)}"
                )
                session.rollback()
                continue

    except Exception as e:
        logger.error("Error processing DELETION_POSSIBLE files", error=str(e))
        session.rollback()
    finally:
        logger.debug("###### End Deletion Manager ######")
        session.close()

The deletion manager cycles through the following operations:

Delete DataTransferPackages from buffers
Delete RawDataPackages from SOURCE and LTA buffers
Delete individual RawDataFiles from processing locations
Delete staged (unpacked) files from processing locations
Process RawDataFiles marked as DELETION_POSSIBLE

Deletion Decision Logic#

The system uses specific conditions to determine when data can be safely deleted from each location type.

RawDataPackages#

From SOURCE Site Buffers#

ccat_data_transfer.deletion_manager.can_delete_raw_data_package_from_source_buffer()

A RawDataPackage can be deleted from SOURCE site buffers when:

Location is of type BUFFER at a SOURCE site
Package exists in at least one LONG_TERM_ARCHIVE location (not just LTA site buffer)
Physical copy at LTA has status PRESENT

Side Effect: When a RawDataPackage is deleted from SOURCE, all associated RawDataFile are marked as DELETION_POSSIBLE.

From LTA Site Buffers#

ccat_data_transfer.deletion_manager.can_delete_raw_data_package_from_lta_buffer()

A RawDataPackage can be deleted from LTA site buffers when:

Location is of type BUFFER at an LTA site
Package exists in the actual DataLocation with type LONG_TERM_ARCHIVE at the same site
Physical copy at LTA has status PRESENT

Never Deleted From#

DataLocation with type LONG_TERM_ARCHIVE - These provide permanent storage and data is never automatically deleted

Implementation#

ccat_data_transfer.deletion_manager.delete_raw_data_packages_bulk()

Bulk deletion implementation for RawDataPackages:

def delete_raw_data_packages_bulk(verbose=False):
    """Bulk deletion of raw data packages and their associated files from source locations.

    This function finds raw data packages that have been fully archived in LTA and can be
    safely deleted from source locations. It schedules bulk deletion tasks for both the
    packages and their associated raw data files, taking into account that SOURCE and BUFFER
    locations can be on different computers.
    """
    if verbose:
        logger.setLevel(logging.DEBUG)

    logger.info("Starting bulk raw data package deletion")
    db = DatabaseConnection()
    session, _ = db.get_connection()

    try:
        # Find deletable packages grouped by location
        deletable_packages_by_location = find_deletable_raw_data_packages_by_location(
            session
        )
        logger.info(
            f"Found {len(deletable_packages_by_location)} locations with deletable packages"
        )

        total_packages = sum(
            len(packages) for packages in deletable_packages_by_location.values()
        )
        logger.info(
            f"Processing {total_packages} raw data packages for bulk deletion across {len(deletable_packages_by_location)} locations"
        )

        if total_packages == 0:
            return

        # Process each location separately
        for location, packages in deletable_packages_by_location.items():
            try:
                # Get physical copies for packages in this location
                package_ids = [p.id for p in packages]
                physical_copies = (
                    session.query(models.RawDataPackagePhysicalCopy)
                    .with_for_update()
                    .filter(
                        models.RawDataPackagePhysicalCopy.raw_data_package_id.in_(
                            package_ids
                        ),
                        models.RawDataPackagePhysicalCopy.data_location_id
                        == location.id,
                        models.RawDataPackagePhysicalCopy.status
                        == models.PhysicalCopyStatus.PRESENT,
                    )
                    .all()
                )

                if not physical_copies:
                    logger.warning(
                        "No pending physical copies found for bulk deletion",
                        location_name=location.name,
                        package_count=len(packages),
                    )
                    continue

                # For SOURCE locations, mark associated raw data files as DELETION_POSSIBLE
                if location.location_type == models.LocationType.SOURCE:
                    for package in packages:
                        mark_raw_data_files_for_deletion(session, package, location)

                # Mark all copies as scheduled for deletion
                physical_copy_ids = [pc.id for pc in physical_copies]
                for pc in physical_copies:
                    pc.status = models.PhysicalCopyStatus.DELETION_SCHEDULED
                session.flush()

                # Schedule bulk package deletion
                queue_name = route_task_by_location(OperationType.DELETION, location)
                delete_bulk_raw_data_packages.apply_async(
                    args=[physical_copy_ids],
                    kwargs={"queue_name": queue_name},
                    queue=queue_name,
                )

                logger.info(
                    "Scheduled bulk raw data package deletion",
                    location_name=location.name,
                    package_count=len(packages),
                    physical_copy_count=len(physical_copies),
                    queue=queue_name,
                )

                # Schedule bulk file deletion for each package
                schedule_bulk_file_deletions(session, packages, location)

                # Process any files marked as DELETION_POSSIBLE
                process_deletion_possible_raw_data_files(session, location)

                # Commit after each successful location to avoid holding locks
                session.commit()
                redis_.publish(
                    "transfer:overview",
                    json.dumps(
                        {
                            "type": "bulk_raw_data_package_deletion_scheduled",
                            "data": {
                                "location_name": location.name,
                                "package_count": len(packages),
                                "physical_copy_count": len(physical_copies),
                            },
                        }
                    ),
                )

            except Exception as inner_e:
                logger.error(
                    f"Error processing bulk deletion for location {location.name}: {str(inner_e)}"
                )
                session.rollback()
                continue

    except Exception as e:
        logger.error(
            "Error during bulk raw data package deletion process", error=str(e)
        )
        session.rollback()
    finally:
        session.close()

DataTransferPackages#

DataTransferPackages are temporary containers that exist only during the transfer process.

From SOURCE Site Buffers#

ccat_data_transfer.deletion_manager.can_delete_data_transfer_package_from_source_buffer()

A DataTransferPackage can be deleted from SOURCE site buffers when:

Location is of type BUFFER at a SOURCE site
Has completed DataTransfer to at least one LTA site
Transfer has unpack_status of COMPLETED

From LTA Site Buffers#

ccat_data_transfer.deletion_manager.can_delete_data_transfer_package_from_lta_buffer()

A DataTransferPackage can be deleted from LTA site buffers when:

Location is of type BUFFER at an LTA site
Package has been successfully transferred and unpacked at ALL other LTA site buffers
Uses round-robin routing logic to determine expected destinations

Never Stored In#

DataLocation with type LONG_TERM_ARCHIVE - DataTransferPackages are unpacked at LTA site buffers; only the extracted RawDataPackage are moved to LTA storage

Implementation#

ccat_data_transfer.deletion_manager.delete_data_transfer_packages()

RawDataFiles#

RawDataFiles follow a two-stage deletion process to handle the large number of individual files efficiently.

Stage 1: Marking as DELETION_POSSIBLE#

When a parent RawDataPackage is deleted from SOURCE, all associated RawDataFile are marked as DELETION_POSSIBLE:

ccat_data_transfer.deletion_manager.mark_raw_data_files_for_deletion()

This uses bulk database updates to avoid performance issues:

def mark_raw_data_files_for_deletion(
    session: Session,
    raw_data_package: models.RawDataPackage,
    source_location: models.DataLocation,
) -> None:
    """
    When RawDataPackage is deleted from SOURCE, mark associated RawDataFiles as DELETION_POSSIBLE.
    Uses bulk update to avoid looping through potentially massive PhysicalCopies.
    """
    # Bulk update all RawDataFile PhysicalCopies at this source location
    updated_count = (
        session.query(models.RawDataFilePhysicalCopy)
        .filter(
            models.RawDataFilePhysicalCopy.data_location_id == source_location.id,
            models.RawDataFilePhysicalCopy.raw_data_file_id.in_(
                session.query(models.RawDataFile.id).filter(
                    models.RawDataFile.raw_data_package_id == raw_data_package.id
                )
            ),
            models.RawDataFilePhysicalCopy.status == models.PhysicalCopyStatus.PRESENT,
        )
        .update(
            {
                models.RawDataFilePhysicalCopy.status: models.PhysicalCopyStatus.DELETION_POSSIBLE
            },
            synchronize_session=False,
        )
    )

    logger.info(
        f"Marked {updated_count} RawDataFile PhysicalCopies as DELETION_POSSIBLE",
        raw_data_package_id=raw_data_package.id,
        location_id=source_location.id,
    )

Stage 2: Conditional Deletion#

Files marked as DELETION_POSSIBLE are processed based on retention policies and buffer status:

ccat_data_transfer.deletion_manager.process_deletion_possible_raw_data_files()

The system considers:

Retention period compliance
Buffer disk usage and pressure
Location-specific rules
Access patterns

Processing Location Cleanup#

RawDataFiles in PROCESSING locations follow different rules based on staging job status.

PRESENT Files (Active Jobs)#

ccat_data_transfer.deletion_manager.delete_processing_raw_data_files()

Files in PROCESSING locations are deleted when:

No active StagingJob references them
All staging jobs using these files have active=False

ccat_data_transfer.deletion_manager.find_deletable_processing_raw_data_files()

STAGED Files (Completed Jobs)#

ccat_data_transfer.deletion_manager.delete_staged_raw_data_files_from_processing()

After staging jobs complete, unpacked files are cleaned up:

Finds RawDataPackages with status STAGED in PROCESSING locations
Verifies all staging jobs for these packages have active=False
Schedules bulk deletion of individual RawDataFiles

ccat_data_transfer.deletion_manager.find_deletable_staged_raw_data_files_by_location()

def find_deletable_staged_raw_data_files_by_location(
    session: Session,
) -> Dict[models.DataLocation, List[models.RawDataFilePhysicalCopy]]:
    """Find RawDataFilePhysicalCopy objects in PROCESSING locations that can be deleted, grouped by location.

    A file can be deleted if:
    1. It's in a PROCESSING location
    2. It's part of a RawDataPackage that has been staged (STAGED status)
    3. All staging jobs for that package are completed (active=False)

    Returns:
        Dictionary mapping DataLocation to list of deletable RawDataFilePhysicalCopy objects
    """
    # First, find all STAGED RawDataPackages in PROCESSING locations
    staged_packages = (
        session.query(models.RawDataPackagePhysicalCopy)
        .join(models.RawDataPackagePhysicalCopy.data_location)
        .filter(
            models.RawDataPackagePhysicalCopy.status
            == models.PhysicalCopyStatus.STAGED,
            models.DataLocation.location_type == models.LocationType.PROCESSING,
        )
        .options(
            joinedload(models.RawDataPackagePhysicalCopy.raw_data_package),
            joinedload(models.RawDataPackagePhysicalCopy.data_location),
        )
        .all()
    )

    logger.info(
        f"Found {len(staged_packages)} STAGED RawDataPackages in processing locations"
    )

    deletable_copies_by_location = {}

    for package_physical_copy in staged_packages:
        raw_data_package = package_physical_copy.raw_data_package
        processing_location = package_physical_copy.data_location

        # Check if all staging jobs for this package are completed (active=False)
        active_staging_jobs = (
            session.query(models.StagingJob)
            .join(models.StagingJob.raw_data_packages)
            .filter(
                models.StagingJob.raw_data_packages.any(id=raw_data_package.id),
                models.StagingJob.active == True,  # noqa: E712
            )
            .count()
        )

        if active_staging_jobs > 0:
            logger.debug(
                "Package has active staging jobs, skipping deletion",
                package_id=raw_data_package.id,
                active_jobs=active_staging_jobs,
                location_name=processing_location.name,
            )
            continue

        # All staging jobs are completed, so we can delete the RawDataFiles
        # Find all RawDataFile physical copies for this package in this processing location
        file_physical_copies = (
            session.query(models.RawDataFilePhysicalCopy)
            .join(models.RawDataFilePhysicalCopy.raw_data_file)
            .filter(
                models.RawDataFilePhysicalCopy.data_location_id
                == processing_location.id,
                models.RawDataFilePhysicalCopy.status
                == models.PhysicalCopyStatus.PRESENT,
                models.RawDataFile.raw_data_package_id == raw_data_package.id,
            )
            .options(
                joinedload(models.RawDataFilePhysicalCopy.raw_data_file),
                joinedload(models.RawDataFilePhysicalCopy.data_location),
            )
            .all()
        )

        logger.info(
            f"Found {len(file_physical_copies)} RawDataFiles to delete for package {raw_data_package.id}",
            package_id=raw_data_package.id,
            location_name=processing_location.name,
            file_count=len(file_physical_copies),
        )

        # Group by location
        if processing_location not in deletable_copies_by_location:
            deletable_copies_by_location[processing_location] = []
        deletable_copies_by_location[processing_location].extend(file_physical_copies)

    total_files = sum(len(files) for files in deletable_copies_by_location.values())
    logger.info(
        f"Total RawDataFiles marked for deletion from processing: {total_files} across {len(deletable_copies_by_location)} locations"
    )
    return deletable_copies_by_location

Deletion Decision Matrix#

The following table summarizes when data is eligible for deletion:

Deletion Rules Summary#
Data Type	Location Type	Deletion Condition	Safety Requirement
RawDataPackage	SOURCE Buffer	Exists in LTA DataLocation	≥1 LTA DataLocation copy with PRESENT status
RawDataPackage	LTA Site Buffer	Exists in same site’s LTA DataLocation	Same site LTA DataLocation copy with PRESENT status
RawDataPackage	LTA DataLocation	Never (automatic)	N/A - Permanent storage
DataTransferPackage	SOURCE Buffer	Verified at LTA site buffer	Completed transfer + unpack to ≥1 LTA site
DataTransferPackage	LTA Site Buffer	Replicated to all other LTA sites	Completed transfers to all LTA sites
DataTransferPackage	LTA DataLocation	Not stored here	N/A - Temporary containers only
RawDataFile	SOURCE/BUFFER	Parent package deleted + retention/buffer rules	DELETION_POSSIBLE status + policy compliance
RawDataFile	PROCESSING	No active staging jobs	All StagingJobs have active=False

Worker Implementation#

Deletion tasks execute on workers with direct access to the storage locations.

Deletion Task Base Class#

class ccat_data_transfer.deletion_manager.DeletionTask[source]

Bases: CCATEnhancedSQLAlchemyTask

Base class for deletion tasks.

__init__()[source]

get_retry_count(session, operation_id)[source]: Get current retry count for this operation.

reset_state_on_failure(session, physical_copy_id, exc)[source]: Reset deletion state for retry.

mark_permanent_failure(session, physical_copy_id, exc)[source]: Mark deletion as permanently failed.

get_operation_info(args, kwargs)[source]: Get additional context for deletion tasks.

acks_late = True

When enabled messages for this task will be acknowledged after the task has been executed, and not right before (the default behavior).

Please note that this means the task may be executed twice if the worker crashes mid execution.

The application default can be overridden with the task_acks_late setting.

acks_on_failure_or_timeout = True

When enabled messages for this task will be acknowledged even if it fails or times out.

Configuring this setting only applies to tasks that are acknowledged after they have been executed and only if task_acks_late is enabled.

The application default can be overridden with the task_acks_on_failure_or_timeout setting.

ignore_result = False: If enabled the worker won’t store task state and return values for this task. Defaults to the task_ignore_result setting.

priority = None: Default task priority.

rate_limit = None

None (no rate limit), ‘100/s’ (hundred tasks a second), ‘100/m’ (hundred tasks a minute),`’100/h’` (hundred tasks an hour)

Type:: Rate limit for this task type. Examples

reject_on_worker_lost = True

Even if acks_late is enabled, the worker will acknowledge tasks when the worker process executing them abruptly exits or is signaled (e.g., KILL/INT, etc).

Setting this to true allows the message to be re-queued instead, so that the task will execute again by the same worker, or another worker.

Warning: Enabling this can cause message loops; make sure you know what you’re doing.

request_stack = <celery.utils.threads._LocalStack object>: Task request stack, the current request will be the topmost.

serializer = 'json': The name of a serializer that are registered with kombu.serialization.registry. Default is ‘json’.

store_eager_result = False

store_errors_even_if_ignored = False: When enabled errors will be stored even if the task is otherwise configured to ignore results.

track_started = True

If enabled the task will report its status as ‘started’ when the task is executed by a worker. Disabled by default as the normal behavior is to not report that level of granularity. Tasks are either pending, finished, or waiting to be retried.

Having a ‘started’ status can be useful for when there are long running tasks and there’s a need to report what task is currently running.

The application default can be overridden using the task_track_started setting.

typing = True: Enable argument checking. You can set this to false if you don’t want the signature to be checked when calling the task. Defaults to app.strict_typing.

Single File Deletion#

ccat_data_transfer.deletion_manager.delete_physical_copy()

This Celery task handles deletion of a single physical copy:

@app.task(
    base=DeletionTask,
    name="ccat:data_transfer:delete:physical_copy",
    bind=True,
)
def delete_physical_copy(
    self,
    physical_copy_id: int,
    queue_name: str,
    session: Session = None,
) -> None:
    """Deletes a physical copy from specified archive.

    Parameters
    ----------
    self : celery.Task
        The Celery task instance.
    physical_copy_id : int
        The ID of the PhysicalCopy object in the database.
    queue_name : str
        The name of the queue to use for this task.
    session : Session, optional
        An existing database session to use. If None, a new session will be created.

    Returns
    -------
    None

    Raises
    ------
    ValueError
        If the physical copy is not found or if the file path is invalid.
    RuntimeError
        If the deletion operation fails.
    """
    # Set the queue dynamically
    self.request.delivery_info["routing_key"] = queue_name

    if session is None:
        with self.session_scope() as session:
            return _delete_physical_copy_internal(session, physical_copy_id)
    else:
        return _delete_physical_copy_internal(session, physical_copy_id)

Bulk Deletion Operations#

For efficiency, the system batches deletions:

Bulk RawDataFile Deletion:

ccat_data_transfer.deletion_manager.delete_bulk_raw_data_files()

Bulk RawDataPackage Deletion:

ccat_data_transfer.deletion_manager.delete_bulk_raw_data_packages()

Internal Implementation#

The internal bulk deletion function handles the actual deletion work:

def _delete_bulk_raw_data_files_internal(
    session: Session, physical_copy_ids: List[int]
) -> None:
    """Internal function to handle bulk deletion of raw data file physical copies."""
    logger.info(
        "Starting bulk raw data file deletion",
        physical_copy_count=len(physical_copy_ids),
        timestamp=datetime.now(BERLIN_TZ).isoformat(),
    )

    successful_deletions = 0
    failed_deletions = 0

    for physical_copy_id in physical_copy_ids:
        try:
            # First get the base PhysicalCopy to determine the type
            base_physical_copy = (
                session.query(models.PhysicalCopy)
                .with_for_update()
                .get(physical_copy_id)
            )

            if not base_physical_copy:
                logger.warning(f"Physical copy {physical_copy_id} not found")
                failed_deletions += 1
                continue

            if (
                base_physical_copy.status
                != models.PhysicalCopyStatus.DELETION_SCHEDULED
            ):
                logger.warning(
                    f"Physical copy {physical_copy_id} is in unexpected state: {base_physical_copy.status}"
                )
                failed_deletions += 1
                continue

            # Now load the specific polymorphic subclass without with_for_update
            if base_physical_copy.type == "raw_data_file_physical_copy":
                physical_copy = (
                    session.query(models.RawDataFilePhysicalCopy)
                    .options(
                        joinedload(models.RawDataFilePhysicalCopy.raw_data_file),
                        joinedload(models.RawDataFilePhysicalCopy.data_location),
                    )
                    .get(physical_copy_id)
                )
            else:
                logger.warning(
                    f"Physical copy {physical_copy_id} is not a raw data file type: {base_physical_copy.type}"
                )
                failed_deletions += 1
                continue

            if not physical_copy:
                logger.warning(
                    f"Failed to load raw data file physical copy {physical_copy_id}"
                )
                failed_deletions += 1
                continue

            # Mark as in progress
            base_physical_copy.status = models.PhysicalCopyStatus.DELETION_IN_PROGRESS
            session.flush()

            # Delete the actual file
            if isinstance(physical_copy.data_location, models.DiskDataLocation):
                if os.path.exists(physical_copy.full_path):
                    os.remove(physical_copy.full_path)
                    logger.debug(f"Deleted disk file: {physical_copy.full_path}")
                else:
                    logger.debug(f"File already deleted: {physical_copy.full_path}")

            elif isinstance(physical_copy.data_location, models.S3DataLocation):
                s3_client = get_s3_client()
                s3_client.delete_object(
                    Bucket=physical_copy.data_location.bucket_name,
                    Key=physical_copy.full_path,
                )
                logger.debug(f"Deleted S3 object: {physical_copy.full_path}")

            elif isinstance(physical_copy.data_location, models.TapeDataLocation):
                logger.warning(
                    f"Tape deletion not implemented for: {physical_copy.full_path}"
                )
                # For now, just mark as deleted without actually deleting from tape
            else:
                raise RuntimeError(
                    f"Unsupported storage type: {type(physical_copy.data_location)}"
                )

            # Mark as deleted
            base_physical_copy.status = models.PhysicalCopyStatus.DELETED
            base_physical_copy.deleted_at = datetime.now(BERLIN_TZ)
            successful_deletions += 1

        except Exception as e:
            logger.error(f"Error deleting physical copy {physical_copy_id}: {str(e)}")
            failed_deletions += 1
            # Reset status for retry
            if "base_physical_copy" in locals():
                base_physical_copy.status = models.PhysicalCopyStatus.PRESENT
                if not hasattr(base_physical_copy, "attempt_count"):
                    base_physical_copy.attempt_count = 0
                base_physical_copy.attempt_count += 1

    # Commit all changes
    session.commit()

    # Publish results
    redis_.publish(
        "transfer:overview",
        json.dumps(
            {
                "type": "bulk_raw_data_file_deletion_completed",
                "data": {
                    "successful_deletions": successful_deletions,
                    "failed_deletions": failed_deletions,
                    "total_deletions": len(physical_copy_ids),
                },
            }
        ),
    )

    logger.info(
        "Bulk raw data file deletion completed",
        successful_deletions=successful_deletions,
        failed_deletions=failed_deletions,
        total_deletions=len(physical_copy_ids),
    )

Benefits of Bulk Operations:

Reduces number of Celery task submissions
Decreases database transaction overhead
Enables more efficient resource utilization
Faster overall deletion throughput

Buffer Management Integration#

The deletion manager integrates with the buffer monitoring system to respond to disk pressure.

Buffer Manager#

ccat_data_transfer.buffer_manager.BufferManager

The buffer manager continuously monitors disk usage:

    def _check_thresholds(self):
        """Check buffer usage against configured thresholds."""
        with self._lock:
            usage = self._buffer_state["usage_percent"]

            # Check emergency threshold
            if usage >= ccat_data_transfer_settings.BUFFER_EMERGENCY_THRESHOLD_PERCENT:
                if not self._buffer_state["is_emergency"]:
                    logger.warning(
                        "Buffer emergency threshold reached",
                        usage_percent=usage,
                        threshold=ccat_data_transfer_settings.BUFFER_EMERGENCY_THRESHOLD_PERCENT,
                    )
                self._buffer_state["is_emergency"] = True
                self._buffer_state["is_critical"] = True

            # Check critical threshold
            elif usage >= ccat_data_transfer_settings.BUFFER_CRITICAL_THRESHOLD_PERCENT:
                if not self._buffer_state["is_critical"]:
                    logger.warning(
                        "Buffer critical threshold reached",
                        usage_percent=usage,
                        threshold=ccat_data_transfer_settings.BUFFER_CRITICAL_THRESHOLD_PERCENT,
                    )
                self._buffer_state["is_critical"] = True
                self._buffer_state["is_emergency"] = False

            # Check warning threshold
            elif usage >= ccat_data_transfer_settings.BUFFER_WARNING_THRESHOLD_PERCENT:
                logger.warning(
                    "Buffer warning threshold reached",
                    usage_percent=usage,
                    threshold=ccat_data_transfer_settings.BUFFER_WARNING_THRESHOLD_PERCENT,
                )
                self._buffer_state["is_critical"] = False
                self._buffer_state["is_emergency"] = False

            # Check recovery threshold
            elif usage <= ccat_data_transfer_settings.BUFFER_RECOVERY_THRESHOLD_PERCENT:
                if (
                    self._buffer_state["is_critical"]
                    or self._buffer_state["is_emergency"]
                ):
                    logger.info(
                        "Buffer recovered below critical threshold",
                        usage_percent=usage,
                        threshold=ccat_data_transfer_settings.BUFFER_RECOVERY_THRESHOLD_PERCENT,
                    )
                self._buffer_state["is_critical"] = False
                self._buffer_state["is_emergency"] = False

Buffer Thresholds#

Thresholds are configured per environment in settings.toml:

BUFFER_WARNING_THRESHOLD_PERCENT = 70
BUFFER_CRITICAL_THRESHOLD_PERCENT = 85
BUFFER_EMERGENCY_THRESHOLD_PERCENT = 95
BUFFER_RECOVERY_THRESHOLD_PERCENT = 60

For production environment:

S3_REGION_NAME = "us-east-1"
S3_BUCKET_NAME = "uploads"

# COSCINE Configuration

Buffer Status Integration#

ccat_data_transfer.deletion_manager.get_buffer_status_for_location()

ccat_data_transfer.deletion_manager.should_delete_based_on_buffer_status()

The system uses different thresholds for different location types:

def should_delete_based_on_buffer_status(
    location: models.DataLocation, buffer_status: dict
) -> bool:
    """Enhanced buffer status checking with location-specific logic."""
    if not buffer_status:
        return False

    # Different thresholds for different location types
    if location.location_type == models.LocationType.SOURCE:
        return buffer_status.get("disk_usage_percent", 0) > 80
    elif location.location_type == models.LocationType.BUFFER:
        return buffer_status.get("disk_usage_percent", 0) > 85
    else:
        return False

Escalating Response to Disk Pressure#

The system adapts its behavior based on buffer conditions:

< 70%:  Normal operations
        • Standard retention policies
        • Full parallel transfer capacity

70-85%: Warning state
        • Logged warnings
        • Normal deletion continues

85-95%: Critical state
        • Reduced parallel transfers
        • Accelerated deletion of eligible data
        • More frequent manager cycles

> 95%:  Emergency state
        • New data creation may be paused
        • Aggressive cleanup of all eligible data
        • Administrator alerts sent
        • Minimal parallel transfers

Configuration#

Deletion Manager Settings#

Key configuration parameters from ccat_data_transfer.config.config:

# sets the -v option for bbcp; it has to be an integer
# 0 = no verbose output
# 1 = verbose output
# 2 = very verbose output
BBCP_VERBOSE = 1

Manager Sleep Times#

Control how frequently each manager checks for work:

RAW_DATA_PACKAGE_MANAGER_SLEEP_TIME = 10      # seconds
DATA_TRANSFER_PACKAGE_MANAGER_SLEEP_TIME = 5
TRANSFER_MANAGER_SLEEP_TIME = 5
DELETION_MANAGER_SLEEP_TIME = 5               # Deletion check frequency
STAGING_MANAGER_SLEEP_TIME = 5

Retention Policies#

# sets the -w option for bbcp
BBCP_WINDOW_SIZE = false

RETENTION_PERIOD_MINUTES - Default retention for processing data (30 days = 43200 minutes)
DISK_USAGE_THRESHOLD_PERCENT - Threshold that triggers accelerated cleanup

Transfer Limits#

BBCP_TARGET_PATH = false

These settings control how the system responds to buffer pressure:

MAX_CRITICAL_TRANSFERS - Maximum parallel transfers when buffer is critical (1)
MAX_NORMAL_TRANSFERS - Maximum parallel transfers under normal conditions (5)

Location-Specific Overrides#

Individual DataLocation instances can override defaults with custom retention policies.

Staging and STAGED Status#

The STAGED status has special meaning in PROCESSING locations.

What is STAGED?#

When a StagingJob completes:

RawDataPackage is transferred to PROCESSING location
Package (tar archive) is unpacked
Individual RawDataFiles are extracted
PhysicalCopy records created for each RawDataFile
Original package archive is deleted to save space
RawDataPackagePhysicalCopy status set to STAGED

This means “unpacked and archive removed”:

def _mark_package_as_staged_and_cleanup(
    session: Session,
    staging_job: models.StagingJob,
    raw_data_package: models.RawDataPackage,
    destination_path: str,
) -> None:
    """Mark RawDataPackage as STAGED and delete the physical package file.

    After unpacking and creating RawDataFile physical copies, we mark the package
    as STAGED and remove the physical package file to save space.
    """
    # Find or create the RawDataPackage physical copy record
    package_physical_copy = (
        session.query(models.RawDataPackagePhysicalCopy)
        .filter(
            and_(
                models.RawDataPackagePhysicalCopy.raw_data_package_id
                == raw_data_package.id,
                models.RawDataPackagePhysicalCopy.data_location_id
                == staging_job.destination_data_location_id,
            )
        )
        .first()
    )

    if not package_physical_copy:
        # Create new record if it doesn't exist
        package_physical_copy = models.RawDataPackagePhysicalCopy(
            raw_data_package_id=raw_data_package.id,
            data_location_id=staging_job.destination_data_location_id,
            status=models.PhysicalCopyStatus.STAGED,
            created_at=datetime.datetime.now(datetime.timezone.utc),
        )
        session.add(package_physical_copy)
    else:
        # Update existing record to STAGED
        package_physical_copy.status = models.PhysicalCopyStatus.STAGED

    # Delete the physical package file
    # For staging, the package file is stored in a temporary location
    # We need to find where the original package file was downloaded
    package_file_path = None

    # Look for the package file in the destination location's raw_data_packages directory
    if isinstance(staging_job.destination_data_location, models.DiskDataLocation):
        # Use just the filename to match the temporary path construction
        package_filename = os.path.basename(raw_data_package.relative_path)
        package_file_path = os.path.join(
            staging_job.destination_data_location.path,
            "raw_data_packages",
            package_filename,
        )

    if package_file_path and os.path.exists(package_file_path):
        try:
            os.remove(package_file_path)
            logger.info(f"Deleted physical package file: {package_file_path}")
        except OSError as e:
            logger.warning(
                f"Failed to delete physical package file {package_file_path}: {str(e)}"
            )
    else:
        logger.debug(
            f"Package file not found at expected location: {package_file_path}"
        )

    session.commit()
    logger.info(f"Marked RawDataPackage {raw_data_package.id} as STAGED")

Cleanup Process#

When staging jobs complete (active=False):

System identifies STAGED packages with inactive jobs
Finds all RawDataFile physical copies for these packages
Schedules bulk deletion of individual files
Updates RawDataPackagePhysicalCopy to DELETED

This two-phase approach (unpack then delete) allows:

Efficient access to individual files during processing
Space savings by removing redundant archives
Clean separation between “in use” and “cleanup ready” states

Deletion Audit Trail#

All deletions are logged and tracked for accountability.

Database Records#

PhysicalCopy records are never deleted from the database, only marked:

class PhysicalCopy:
    status: PhysicalCopyStatus  # DELETED
    deleted_at: datetime        # When deletion occurred
    # Additional tracking fields depend on subclass

PhysicalCopy subclasses retain their records to maintain a complete audit trail:

Deletion Logging#

The deletion manager includes helper functions for structured logging:

def _add_deletion_log(
    session: Session,
    physical_copy: models.PhysicalCopy,
    message: str
) -> None:
    """Add deletion log entry for audit trail."""
    # Logs include:
    # - Timestamp
    # - Physical copy ID and type
    # - Location information
    # - Reason for deletion
    # - Success/failure status

Query Deletion History#

Database queries can retrieve deletion history:

-- Show all deletions in last 24 hours
SELECT
    pc.id,
    pc.type,
    pc.status,
    pc.deleted_at,
    dl.name as location_name
FROM physical_copy pc
JOIN data_location dl ON pc.data_location_id = dl.id
WHERE pc.status = 'DELETED'
  AND pc.deleted_at > NOW() - INTERVAL '24 hours'
ORDER BY pc.deleted_at DESC;

Log Files#

Structured logs capture deletion details using the centralized logging system:

{
  "timestamp": "2024-11-27T10:30:00Z",
  "level": "INFO",
  "logger": "ccat_data_transfer.deletion_manager",
  "event": "physical_copy_deleted",
  "physical_copy_id": 12345,
  "copy_type": "raw_data_file",
  "location": "ccat_telescope_buffer",
  "size_bytes": 1048576,
  "reason": "parent_package_archived"
}

Manual Deletion#

Administrators can manually trigger deletion operations when needed.

Warning

Manual deletion should be used with caution. Always verify data exists in LTA locations before forcing deletion from SOURCE or BUFFER locations.

Available CLI Commands#

The system provides limited CLI commands for inspection:

List Data Locations:

# View all available locations
ccat_data_transfer list-locations

This shows all configured sites and their locations, useful for identifying location names for manual operations.

Monitor Disk Usage:

# Monitor all active disk locations
ccat_data_transfer disk-monitor --all

# Monitor specific location
ccat_data_transfer disk-monitor --location-name cologne_buffer

# Monitor by site
ccat_data_transfer disk-monitor --site cologne

Python API for Manual Operations#

For administrative scripting and manual deletion operations, use the Python API:

Inspect Deletable Data:

from ccat_data_transfer.deletion_manager import (
    find_deletable_raw_data_packages_by_location,
    find_deletable_data_transfer_packages
)
from ccat_data_transfer.database import DatabaseConnection

# Get database connection
db = DatabaseConnection()
session, _ = db.get_connection()

try:
    # Find deletable RawDataPackages by location
    deletable_packages = find_deletable_raw_data_packages_by_location(session)

    print("\n=== Deletable RawDataPackages ===")
    for location, packages in deletable_packages.items():
        total_size = sum(p.size for p in packages)
        print(f"\nLocation: {location.name} ({location.location_type.value})")
        print(f"  Site: {location.site.name}")
        print(f"  Packages: {len(packages)}")
        print(f"  Total size: {total_size / (1024**3):.2f} GB")

    # Find deletable DataTransferPackages
    deletable_transfers = find_deletable_data_transfer_packages(session)

    print("\n=== Deletable DataTransferPackages ===")
    for package, location in deletable_transfers:
        print(f"Package: {package.file_name}")
        print(f"  Location: {location.name}")
        print(f"  Size: {package.size / (1024**3):.2f} GB")

finally:
    session.close()

Trigger Manual Deletion Cycle:

from ccat_data_transfer.deletion_manager import delete_data_packages

# Run one deletion cycle with verbose logging
delete_data_packages(verbose=True)

print("Deletion cycle completed")

Schedule Specific Deletions:

from ccat_data_transfer.deletion_manager import delete_physical_copy
from ccat_data_transfer.queue_discovery import route_task_by_location
from ccat_data_transfer.operation_types import OperationType
from ccat_data_transfer.database import DatabaseConnection
from ccat_ops_db import models

db = DatabaseConnection()
session, _ = db.get_connection()

try:
    # Find a specific physical copy to delete
    physical_copy = session.query(models.RawDataPackagePhysicalCopy).filter(
        models.RawDataPackagePhysicalCopy.id == 12345,
        models.RawDataPackagePhysicalCopy.status == models.PhysicalCopyStatus.PRESENT
    ).first()

    if physical_copy:
        # Safety check: Verify it's actually deletable
        # (Add your safety checks here based on location type and package state)

        # Mark as scheduled
        physical_copy.status = models.PhysicalCopyStatus.DELETION_SCHEDULED
        session.commit()

        # Route to appropriate queue
        queue_name = route_task_by_location(
            OperationType.DELETION,
            physical_copy.data_location
        )

        # Schedule deletion task
        delete_physical_copy.apply_async(
            args=[physical_copy.id],
            kwargs={"queue_name": queue_name},
            queue=queue_name
        )

        print(f"Scheduled deletion of physical copy {physical_copy.id}")

finally:
    session.close()

Deletion Service Management#

The deletion manager runs as a continuous service. To control it:

Start the deletion manager:

# Start as a service (runs continuously)
ccat_data_transfer deletion-manager

# Start with verbose logging
ccat_data_transfer deletion-manager -v

The deletion manager will run in a loop, checking for deletable data every DELETION_MANAGER_SLEEP_TIME seconds (default: 5 seconds).

In Docker Compose deployments:

The deletion manager runs as a service defined in docker-compose.yml. To restart:

# Restart the deletion manager service
docker-compose restart deletion-manager

# View deletion manager logs
docker-compose logs -f deletion-manager

Safety Considerations#

When performing manual deletions:

Verify LTA copies exist - Always check that data is safely in LTA before deleting from SOURCE
Check package state - Ensure RawDataPackage state is ARCHIVED
Review deletion logs - Check logs to understand why automatic deletion hasn’t occurred
Test in development first - Run manual deletion scripts in dev environment
Use transactions - Wrap operations in database transactions for atomicity
Monitor disk space - Check if manual deletion is actually needed or if automatic cleanup is working

Data Recovery#

If data is accidentally deleted, recovery options depend on the location type.

Recovery from LTA#

If data was deleted from PROCESSING or BUFFER locations:

Verify data exists in DataLocation with type LONG_TERM_ARCHIVE
Create a new StagingJob to re-stage the data
System will retrieve data from LTA and unpack to PROCESSING location
No actual data loss, just need to re-copy

Recovery from SOURCE#

If data was deleted from SOURCE before reaching LTA (should never happen due to safety checks):

Check database for PhysicalCopy records
Verify if package exists in any LTA location
If in LTA: Can be recovered via staging
If not in LTA: Data may be permanently lost - check backup systems

Prevention Mechanisms#

Multiple safeguards prevent accidental deletion:

SOURCE deletions require package state ARCHIVED
Double-check in worker before actual file deletion
Database transactions ensure consistency
Deletion manager logs all decisions
Physical copy records retained for audit

Best Practices#

For Instrument Teams#

File data promptly - Use ops-db-api to register new data quickly
Never manually delete - Let the system manage lifecycle automatically
Monitor filing status - Check ops-db-ui for package states
Trust the system - Automatic lifecycle management is safer than manual intervention

For Administrators#

Monitor buffer trends - Add capacity before reaching warning thresholds (70%)
Review deletion logs - Periodically check for unexpected patterns
Adjust retention periods - Tune based on actual usage patterns and disk capacity
Test recovery procedures - Regularly verify staging from LTA works correctly
Monitor metrics - Use InfluxDB dashboards to track deletion rates

For Scientists#

Set appropriate retention - Configure StagingJob retention periods based on analysis needs
Mark jobs inactive - Set active=False when processing completes to enable cleanup
Don’t rely on PROCESSING - Use LTA locations for long-term data access, not temporary processing areas
Plan disk usage - Consider data volume when creating multiple staging jobs

For Developers#

Always check package state - Verify ARCHIVED state before deleting from SOURCE
Use bulk operations - Batch deletions for efficiency when handling many files
Add generous logging - Structured logs are essential for debugging deletion issues
Test deletion logic - Thoroughly test edge cases in safety checks
Consider race conditions - Use database transactions and locks appropriately

Troubleshooting#

Common Issues#

Data not deleting from SOURCE

Check:

Package state is ARCHIVED (not just TRANSFERRING)
Physical copy exists in LTA location with status PRESENT
Deletion manager is running and processing location
Check logs for errors in deletion manager cycle

Buffer filling up

Solutions:

Verify deletion manager is running correctly
Check if data is actually reaching LTA
Review buffer thresholds in configuration
Consider increasing DELETION_MANAGER_SLEEP_TIME (more frequent cycles)
Manually trigger cleanup if needed

Files stuck in DELETION_POSSIBLE

This means files are waiting for retention/buffer policies:

Check buffer status for the location
Verify retention period settings
Review should_delete_based_on_buffer_status logic
Check if buffer monitoring is active

Debugging#

Enable verbose logging:

from ccat_data_transfer.deletion_manager import delete_data_packages

# Run with verbose logging
delete_data_packages(verbose=True)

Check Redis for buffer status:

from ccat_data_transfer.deletion_manager import get_buffer_status_for_location

status = get_buffer_status_for_location("cologne_buffer")
print(f"Disk usage: {status.get('disk_usage_percent')}%")

Query database for deletion candidates:

from ccat_data_transfer.deletion_manager import (
    find_deletable_raw_data_packages_by_location
)
from ccat_data_transfer.database import DatabaseConnection

db = DatabaseConnection()
session, _ = db.get_connection()

deletable = find_deletable_raw_data_packages_by_location(session)
for location, packages in deletable.items():
    print(f"{location.name}: {len(packages)} packages")

Next Steps#

Monitoring & Failure Recovery - Buffer monitoring, metrics, and alerting
Pipeline Architecture - Complete data flow including lifecycle stages
Core Concepts - Data model fundamentals (PhysicalCopy, Package concepts)
api_reference - Complete API documentation for deletion functions

Data Lifecycle Management#

Lifecycle Philosophy#

Data States#

PhysicalCopyStatus#

State Transitions#

PackageState#

Deletion Manager#

Main Entry Point#

Deletion Decision Logic#

RawDataPackages#

From SOURCE Site Buffers#

From LTA Site Buffers#

Never Deleted From#

Implementation#

DataTransferPackages#

From SOURCE Site Buffers#

From LTA Site Buffers#

Never Stored In#

Implementation#

RawDataFiles#

Stage 1: Marking as DELETION_POSSIBLE#

Stage 2: Conditional Deletion#

Processing Location Cleanup#

PRESENT Files (Active Jobs)#

STAGED Files (Completed Jobs)#

Deletion Decision Matrix#

Worker Implementation#

Deletion Task Base Class#

Single File Deletion#

Bulk Deletion Operations#

Internal Implementation#

Buffer Management Integration#

Buffer Manager#

Buffer Thresholds#

Buffer Status Integration#

Escalating Response to Disk Pressure#

Configuration#

Deletion Manager Settings#

Manager Sleep Times#

Retention Policies#

Transfer Limits#

Location-Specific Overrides#

Staging and STAGED Status#

What is STAGED?#

Cleanup Process#

Deletion Audit Trail#

Database Records#

Deletion Logging#

Query Deletion History#

Log Files#

Manual Deletion#

Available CLI Commands#

Python API for Manual Operations#

Deletion Service Management#

Safety Considerations#

Data Recovery#

Recovery from LTA#

Recovery from SOURCE#

Prevention Mechanisms#

Best Practices#

For Instrument Teams#

For Administrators#

For Scientists#

For Developers#

Troubleshooting#

Common Issues#

Debugging#

Next Steps#

This Page