/ComfyUI-Mana-Nodes

Font Animation + Speech2Text Custom Nodes for ComfyUI

Primary LanguagePythonMIT LicenseMIT

ComfyUI-Mana-Nodes

Collection of custom nodes for ComfyUI.

Installation

Simply clone the repo into the custom_nodes directory with this command:

git clone https://github.com/ForeignGods/ComfyUI-Mana-Nodes.git

and install the requirements using:

.\python_embed\python.exe -s -m pip install -r requirements.txt --user

If you are using a venv, make sure you have it activated before installation and use:

pip install -r requirements.txt

Demo

gif_00008-ezgif com-optimize gif_00008-ezgif com-optimize gif_00008-ezgif com-optimize gif_00008-ezgif com-optimize
speech2text.mp4

To-Do

  • Font to Image Batch Animation
  • Split Video to Frames and Audio
  • Speech-to-Text Conversion
  • SVG Loader/Animator
  • Font to Image Alpha Channel
  • keyframe model/lora switcher for animateDiff
  • animation process of transition from pictures to videos
  • add font support for other languages

Nodes

font2img Node

Required Inputs

Configure the font2img node by setting the following parameters in ComfyUI:

  • font_file: fonts located in the custom_nodes\ComfyUI-Mana-Nodes\font\example_font.ttf directory (supports .ttf, .otf, .woff, .woff2).
  • font_color: Color of the text. (https://www.w3.org/wiki/CSS3/Color/Extended_color_keywords)
  • background_color: Background color of the image.
  • border_color: Color of the border around the text.
  • border_width: Width of the text border.
  • shadow_color: Width of the text border.
  • shadow_offset_x: Horizontal offset of the shadow.
  • shadow_offset_y: Vertical offset of the shadow.
  • line_spacing: Spacing between lines of text.
  • kerning: Spacing between characters of font.
  • padding: Padding between image border and font.
  • frame_count: Number of frames (images) to generate.
  • image_width: Width of the generated images.
  • image_height: Height of the generated images.
  • transcription_mode: Mode of text transcription ('word', 'line', 'fill').
  • text_alignment: Alignment of the text in the image.
  • text_interpolation_options: Mode of text interpolation ('strict', 'interpolation', 'cumulative').
  • text: The text to render in the images. (is ignored when optional input transcription is given)
  • animation_reset: Defines when the animation resets ('word', 'line', 'never').
  • animation_easing: Easing function for animation (e.g., 'linear', 'exponential').
  • animation_duration: Duration of the animation.
  • start_font_size, end_font_size: Starting and ending size of the font.
  • start_x_offset, end_x_offset, start_y_offset, end_y_offset: Offsets for text positioning.
  • start_rotation, end_rotation: Rotation angles for the text.
  • rotation_anchor_x, rotation_anchor_y: offset of the rotation anchor point, relative to the texts initial position.

Optional Inputs

  • input_images: Text will be overlayed on input_images instead of background_color.
  • transcription: Transcription from the speech2text node, contains dict with timestamps, framerate and transcribed words.

Outputs

  • images: The generated images with the specified text and configurations.
  • transcription_framestamps: Outputs a string containing the framestamps, new line calculated based on image width. (Can be useful to manually correct mistakes by speech recognition)
    • Example: Save this output with string2file -> correct mistakes -> remove transcription input from font2img -> paste corrected framestamps into text input field of font2img node.

Parameters Explanation

text

  • Specifies the text to be rendered on the images. Supports multiline text input for rendering on separate lines.
    • For simple text: Input the text directly as a string.
    • For frame-specific text (in modes like 'strict' or 'cumulative'): Use a JSON-like format where each line specifies a frame number and the corresponding text. Example:
      "1": "Hello",
      "10": "World",
      "20": "End"
      

text_interpolation_options

  • Defines the mode of text interpolation between frames.
    • strict: Text is only inserted at specified frames.
    • interpolation: Gradually interpolates text characters between frames.
    • cumulative: Text set for a frame persists until updated in a subsequent frame.

start_x_offset, end_x_offset, start_y_offset, end_y_offset

  • Sets the starting and ending offsets for text positioning on the X and Y axes, allowing for text transition across the image.
  • Input as integers. Example: start_x_offset = 10, end_x_offset = 50 moves the text from 10 pixels from the left to 50 pixels from the left across frames.

start_rotation, end_rotation

  • Defines the starting and ending rotation angles for the text, enabling it to rotate between these angles.
  • Input as integers in degrees. Example: start_rotation = 0, end_rotation = 180 rotates the text from 0 to 180 degrees across frames.

start_font_size, end_font_size

  • Sets the starting and ending font sizes for the text, allowing the text size to dynamically change across frames.
  • Input as integers representing the font size in points. Example: start_font_size = 12, end_font_size = 24 will gradually increase the text size from 12 to 24 points across the frames.

animation_reset

  • Dictates when the animation effect resets to its starting conditions.
    • word: Resets animation with each new word.
    • line: Resets animation at the beginning of each new line of text.
    • never: The animation does not reset, but continues throughout.

animation_easing

  • Controls the pacing of the animation.
    • Examples include linear, exponential, quadratic, cubic, elastic, bounce, back, ease_in_out_sine, ease_out_back, ease_in_out_expo.
    • Each option provides a different acceleration curve for the animation, affecting how the text transitions and rotates.

animation_duration

  • The length of time each animation takes to complete, measured in frames.
  • A larger value means a slower, more gradual transition, while a smaller value results in a quicker animation.

transcription_mode

  • Determines how the transcribed text is applied across frames.
    • word: Each word appears on its corresponding frame based on the transcription timestamps.
    • line: Similar to word, but text is added line by line.
    • fill: Continuously fills the frame with text, adding new words at their specific timestamps.

video2audio Node

Extracts frames and audio from a video file.

Required Inputs

  • video: Path the video file.
  • frame_limit: Maximum number of frames to extract from the video.
  • frame_start: Starting frame number for extraction.
  • filename_prefix: Prefix for naming the extracted audio file. (relative to .\ComfyUI-Mana-Nodes)

Outputs

  • frames: Extracted frames as image tensors.
  • frame_count: Total number of frames extracted.
  • audio: Path of the extracted audio file.
  • fps: Frames per second of the video.
  • height, width: Dimensions of the extracted frames.

speech2text Node

Converts spoken words in an audio file to text using a deep learning model.

Required Inputs

  • audio: Audio file path or URL.
  • wav2vec2_model: The Wav2Vec2 model used for speech recognition. (https://huggingface.co/models?search=wav2vec2)
  • spell_check_language: Language for the spell checker.
  • framestamps_max_chars: Maximum characters allowed until new framestamp lines created.

Optional Inputs

  • fps: Frames per second, used for synchronizing with video. (Default set to 30)

Outputs

  • transcription: Text transcription of the audio. (Should only be used as font2img transcription input)
  • raw_string: Raw string of the transcription without timestamps.
  • framestamps_string: Frame-stamped transcription.
  • timestamps_string: Transcription with timestamps.

Example Outputs

  • raw_string: Returns the transcribed text as one line.
THE GREATEST TRICK THE DEVIL EVER PULLED WAS CONVINCING THE WORLD HE DIDN'T EXIST
  • framestamps_string: Depending on the framestamps_max_chars parameter the sentece will be cleared and starts to build up again until max_chars is reached again.
    • In this example framestamps_max_chars is set to 25.
"27": "THE",
"31": "THE GREATEST",
"43": "THE GREATEST TRICK",
"73": "THE GREATEST TRICK THE",
"77": "DEVIL",
"88": "DEVIL EVER",
"94": "DEVIL EVER PULLED",
"127": "DEVIL EVER PULLED WAS",
"133": "CONVINCING",
"150": "CONVINCING THE",
"154": "CONVINCING THE WORLD",
"167": "CONVINCING THE WORLD HE",
"171": "DIDN'T",
"178": "DIDN'T EXIST",

timestamps_string: Returns all transcribed words, their start_time and end_time in json format as a string.

[
  {
    "word": "THE",
    "start_time": 0.9,
    "end_time": 0.98
  },
  {
    "word": "GREATEST",
    "start_time": 1.04,
    "end_time": 1.36
  },
  {
    "word": "TRICK",
    "start_time": 1.44,
    "end_time": 1.68
  },
  {
    "word": "THE",
    "start_time": 2.42,
    "end_time": 2.5
  },
  {
    "word": "DEVIL",
    "start_time": 2.58,
    "end_time": 2.82
  },
  {
    "word": "EVER",
    "start_time": 2.92,
    "end_time": 3.04
  },
  {
    "word": "PULLED",
    "start_time": 3.14,
    "end_time": 3.44
  },
  {
    "word": "WAS",
    "start_time": 4.22,
    "end_time": 4.34
  },
  {
    "word": "CONVINCING",
    "start_time": 4.44,
    "end_time": 4.92
  },
  {
    "word": "THE",
    "start_time": 5.0,
    "end_time": 5.06
  },
  {
    "word": "WORLD",
    "start_time": 5.12,
    "end_time": 5.42
  },
  {
    "word": "HE",
    "start_time": 5.58,
    "end_time": 5.62
  },
  {
    "word": "DIDN'T",
    "start_time": 5.7,
    "end_time": 5.88
  },
  {
    "word": "EXIST",
    "start_time": 5.94,
    "end_time": 6.28
  }
]

string2file Node

Writes a given string to a text file.

Required Inputs

  • string: The string to be written to the file.
  • filename_prefix: Prefix for naming the text file. (relative to .\ComfyUI-Mana-Nodes)

audio2video Node

Combines a sequence of images (frames) with an audio file to create a video.

Required Inputs

  • audio: Audio file path or URL.
  • frames: Sequence of images to be used as video frames.
  • filename_prefix: Prefix for naming the video file. (relative to .\ComfyUI-Mana-Nodes)
  • fps: Frames per second for the video.

Outputs

  • video_file_path: Path to the created video file.

Example Workflows

Font Animation

These workflows are included in the example_workflows directory:

example_workflow_1.json

Screenshot 2024-03-05 at 15-54-43 ComfyUI

example_workflow_2.json

Screenshot 2024-03-14 at 15-24-36 ComfyUI

Speech Recognition

Font Licences

  • Personal Use: The included fonts are for personal, non-commercial use. Please refrain from using these fonts in any commercial project without obtaining the appropriate licenses.
  • License Compliance: Each font may come with its own license agreement. It is the responsibility of the user to review and comply with these agreements. Some fonts may require a license for commercial use, modification, or distribution.
  • Removing Fonts: If any font creator or copyright holder wishes their font to be removed from this repository, please contact us, and we will promptly comply with your request.

Font Links

Contributing

Buy Me A Coffee

Your contributions to improve Mana Nodes are welcome! If you have suggestions or enhancements, feel free to fork this repository, apply your changes, and create a pull request. For significant modifications or feature requests, please open an issue first to discuss what you'd like to change.