QutEcoacoustics/baw-server

Failed to copy file in harvest job

Opened this issue · 2 comments

Getting a lot of variants on this theme recently:

A RuntimeError occurred in background at 2023-11-08 22:40:12 UTC :

  Failed to move /data/harvester_to_do/harvest_428/[REDACTED].flac to any of [#<Pathname:/data/original_audio/de/deaa90c7-32e7-458c-b380-db45aae829f4.flac>]
  /home/baw_web/baw-server/lib/gems/baw-workers/lib/baw_workers/file_info.rb:106:in `copy_to_any'

-------------------------------
Backtrace:
-------------------------------

  /home/baw_web/baw-server/lib/gems/baw-workers/lib/baw_workers/file_info.rb:106:in `copy_to_any'
  /home/baw_web/baw-server/lib/gems/baw-workers/lib/baw_workers/jobs/harvest/harvest_job.rb:557:in `block in copy_file'
  /usr/local/bundle/gems/semantic_logger-4.11.0/lib/semantic_logger/base.rb:374:in `measure_internal'
  /usr/local/bundle/gems/semantic_logger-4.11.0/lib/semantic_logger/base.rb:97:in `measure_info'
  /home/baw_web/baw-server/lib/gems/baw-workers/lib/baw_workers/jobs/harvest/harvest_job.rb:556:in `copy_file'
  /home/baw_web/baw-server/lib/gems/baw-workers/lib/baw_workers/jobs/harvest/harvest_job.rb:491:in `harvest_file'
  /usr/local/bundle/gems/semantic_logger-4.11.0/lib/semantic_logger/loggable.rb:110:in `harvest_file'
  /home/baw_web/baw-server/lib/gems/baw-workers/lib/baw_workers/jobs/harvest/harvest_job.rb:233:in `process_steps'
  /usr/local/bundle/gems/semantic_logger-4.11.0/lib/semantic_logger/loggable.rb:110:in `process_steps'
  /home/baw_web/baw-server/lib/gems/baw-workers/lib/baw_workers/jobs/harvest/harvest_job.rb:171:in `perform'
  /usr/local/bundle/gems/activejob-7.0.3.1/lib/active_job/execution.rb:59:in `block in _perform_job'

The errors occasionally happen.

Originally, I thought #660 was the root cause (a double harvest item was triggering a double harvest job).
But that doesn't match up with database state.

  • we have one harvest item
  • the associated audio_recording is set and linked to audio_recording_id
  • the final result is audio_recordings.status is aborted because copy_file throws
  • this can't be a uuid reuse situation because the audio recording is already saved to the database and the unique index would have prevented duplicates
  • the files appear to exist already
    • the existing file has a different hash to the source hash
    • the existing file has a recent modification date!
    • the existing file is empty! 0 bytes!

Ok theory:

  • the file is opened
  • something fails... somehow
  • then we... somehow... attempt the copy again
    • this bit confuses me though because we'd have to retry the operation somehow
    • the job execution count is 0
    • checked logs the job only ran on one server and only ran once
    • it looks like we could be:
      def copy_to_any(source, targets)
      raise "File not found: #{source}" unless source.exist?
      raise 'No targets provided' if targets.empty?
      # try each target
      targets.each do |target|
      raise "Target is not a Pathname: #{target}" unless target.is_a?(Pathname)
      target.dirname.mkpath
      FileUtils.copy(source, target)
      exist = target.exist?
      size = target.size
      unless exist && size == source.size
      raise "Failed to copy #{source} to #{target}, exists?:#{exist}, size:#{size}, source_size:#{source.size}"
      end
      return target
      rescue StandardError => e
      BawWorkers::Config.logger_worker.warn "Failed to move #{source} to #{target}: #{e}"
      end
      # if we get here, all targets failed
      raise "Failed to move #{source} to any of #{targets}"
    • but targets is an array of one path? so no
  • ✅ which fails because the file exists (but is empty)
    • This was confirmed by the logs
    2023-11-08T22:40:10.671 W [96661:9300] [BawWorkers::Jobs::Harvest::HarvestJob] [harvest_job:item=774983:should_harvest=true] BawWorkers -- Failed to move /data/harvester_to_do/harvest_428/[REDACTED].flac to /data/original_audio/de/deaa90c7-32e7-458c-b380-db45aae829f4.flac: Failed to copy /data/harvester_to_do/harvest_428/[REDACTED] to /data/original_audio/de/deaa90c7-32e7-458c-b380-db45aae829f4.flac, exists?:true, size:0, source_size:94646884
    

This would match up with our recent media issues #663 which are also suffering because of 0 byte files. This is likely tied to our recent network storage change.

Possible courses of action:

  • change copy method (FileUtils.copy seems to provide no error information at all)
  • check to see if path exists before copying
    • diagnostically useful for this bug
    • but also allows us to confidently know we're not mutating data from another file
  • clean up after ourselves if copy fails (remove the zero byte stub)
  • check to see if this an artefact of the current file system in use by Ecosounds
  • repair files that have failed unnecessarily
    • find any status aborted recordings from the last month
    • delete the audio at the path if they are zero-bytes files
    • can i retry the harvest job?
      • looks like I can
        result = catch(:halt) {
        # if we're retrying this harvest the audio recording might already exist
        if harvest_item.audio_recording_id
        logger.debug('Audio recording already exists, fetching',
        audio_recording_id: harvest_item.audio_recording_id)
        @audio_recording = AudioRecording.find(harvest_item.audio_recording_id)
        harvest_item.audio_recording_id = audio_recording.id
        return true
        end

This is definitely a recent problem:

image

An important update: our vendor noticed partial files were copied. Can't rely on the destination being 0 bytes as a reliable indicator of this problem.