arcee-ai/mergekit

Base Model generation time increases when passed through the MergeKit

ahmedamrelhefnawy opened this issue · 0 comments

I am currently evaluating the performance efficiency of a Hugging Face model by comparing two approaches: using the model directly through the Hugging Face model class versus disassembling and reassembling its 32 layers sequentially with the passthrough method from MergeKit.

Configuration Details

Below is the YML configuration file used for the experiment:

slices:
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [0,1]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [1,2]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [2,3]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [3,4]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [4,5]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [5,6]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [6,7]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [7,8]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [8,9]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [9,10]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [10,11]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [11,12]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [12,13]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [13,14]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [14,15]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [15,16]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [16,17]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [17,18]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [18,19]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [19,20]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [20,21]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [21,22]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [22,23]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [23,24]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [24,25]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [25,26]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [26,27]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [27,28]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [28,29]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [29,30]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [30,31]
      - sources:
        - model: internlm/internlm2_5-7b-chat
          layer_range: [31,32]
    

merge_method: passthrough
dtype: bfloat16

Performance Metrics

The metric used for evaluation is generation time per token, as detailed below:

  • Input of 575 Tokens Input:

    • Direct Model Usage : 3.4767779807548025 seconds per token
    • MergeKit Passthrough Model : 4.2156252472011655 seconds per token
  • Input of 311 Tokens Input:

    • Direct Model Usage : 3.32432980222387 seconds per token
    • MergeKit Passthrough Model : 4.17318613631828 seconds per token
  • Input of 107 Tokens Input:

    • Direct Model Usage : 2.503785534783288 seconds per token
    • MergeKit Passthrough Model : 4.000283993042268 seconds per token

Why this happens and how can I fix it?
I notices this when I tried to remove 1 layer from the model and test its performance, and unexpectedly the time per token increased instead of decreasing