Long-form audio speaker diarization OOM in clustering

Question

Long-form audio speaker diarization OOM in clustering

remenberl opened this issue 10 months ago · 1 comments

Hi,

Thanks for the recent development of long-form audio speaker diarization in NVIDIA/NeMo#7737. Recently I encounter a 4-hour-long audio and observe OOM on RAM (not VRAM).

It happens after screen prints the last iteration of "Extracting embeddings for Diarization" and the program consumes more than 64GB memory when I observe job getting killed. FYI,

[NeMo I 2023-11-19 20:54:29 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-11-19 20:54:29 collections:445] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-11-19 20:54:29 collections:446] Dataset loaded with 52949 items, total duration of  7.25 hours.
[NeMo I 2023-11-19 20:54:29 collections:448] # 52949 files loaded accounting to # 1 labels

My telephonic config file:

  clustering:
    parameters:
      oracle_num_speakers: False
      max_num_speakers: 8
      enhanced_count_thres: 80
      max_rp_threshold: 0.25
      sparse_search_volume: 30
      maj_vote_spk_count: False 
      chunk_cluster_count: 50
      embeddings_per_chunk: 10000
  msdd_model:
    model_path: diar_msdd_telephonic
    parameters:
      use_speaker_model_from_ckpt: True 
      infer_batch_size: 25
      sigmoid_threshold: [0.7] 
      seq_eval_mode: False
      split_infer: True
      diar_window_length: 50
      overlap_infer_spk_limit: 5

Answer 1 · 2023-11-20T02:16:01.000Z

Submitted to the wrong repo.