ComfyUI-Mana-Nodes

Collection of custom nodes for ComfyUI.

Installation
Demo
To-Do
Nodes
Example Workflows
- Font Animation
- Speech Reconition
Font Licences
- Font Links
Contributing

Installation

Simply clone the repo into the custom_nodes directory with this command:

git clone https://github.com/ForeignGods/ComfyUI-Mana-Nodes.git

and install the requirements using:

.\python_embed\python.exe -s -m pip install -r requirements.txt --user

If you are using a venv, make sure you have it activated before installation and use:

pip install -r requirements.txt

Demo

speech2text.mp4

To-Do

Font to Image Batch Animation
Split Video to Frames and Audio
Speech-to-Text Conversion
SVG Loader/Animator
Font to Image Alpha Channel
keyframe model/lora switcher for animateDiff
animation process of transition from pictures to videos
add font support for other languages

Nodes

font2img Node

Required Inputs

Configure the font2img node by setting the following parameters in ComfyUI:

font_file: fonts located in the custom_nodes\ComfyUI-Mana-Nodes\font\example_font.ttf directory (supports .ttf, .otf, .woff, .woff2).
font_color: Color of the text. (https://www.w3.org/wiki/CSS3/Color/Extended_color_keywords)
background_color: Background color of the image.
border_color: Color of the border around the text.
border_width: Width of the text border.
shadow_color: Width of the text border.
shadow_offset_x: Horizontal offset of the shadow.
shadow_offset_y: Vertical offset of the shadow.
line_spacing: Spacing between lines of text.
kerning: Spacing between characters of font.
padding: Padding between image border and font.
frame_count: Number of frames (images) to generate.
image_width: Width of the generated images.
image_height: Height of the generated images.
transcription_mode: Mode of text transcription ('word', 'line', 'fill').
text_alignment: Alignment of the text in the image.
text_interpolation_options: Mode of text interpolation ('strict', 'interpolation', 'cumulative').
text: The text to render in the images. (is ignored when optional input transcription is given)
animation_reset: Defines when the animation resets ('word', 'line', 'never').
animation_easing: Easing function for animation (e.g., 'linear', 'exponential').
animation_duration: Duration of the animation.
start_font_size, end_font_size: Starting and ending size of the font.
start_x_offset, end_x_offset, start_y_offset, end_y_offset: Offsets for text positioning.
start_rotation, end_rotation: Rotation angles for the text.
rotation_anchor_x, rotation_anchor_y: offset of the rotation anchor point, relative to the texts initial position.

Optional Inputs

input_images: Text will be overlayed on input_images instead of background_color.
transcription: Transcription from the speech2text node, contains dict with timestamps, framerate and transcribed words.

Outputs

images: The generated images with the specified text and configurations.
transcription_framestamps: Outputs a string containing the framestamps, new line calculated based on image width. (Can be useful to manually correct mistakes by speech recognition)
- Example: Save this output with string2file -> correct mistakes -> remove transcription input from font2img -> paste corrected framestamps into text input field of font2img node.

Parameters Explanation

`text`

Specifies the text to be rendered on the images. Supports multiline text input for rendering on separate lines.
- For simple text: Input the text directly as a string.
- For frame-specific text (in modes like 'strict' or 'cumulative'): Use a JSON-like format where each line specifies a frame number and the corresponding text. Example:
```
"1": "Hello",
"10": "World",
"20": "End"
```

`text_interpolation_options`

Defines the mode of text interpolation between frames.
- strict: Text is only inserted at specified frames.
- interpolation: Gradually interpolates text characters between frames.
- cumulative: Text set for a frame persists until updated in a subsequent frame.

`start_x_offset`, `end_x_offset`, `start_y_offset`, `end_y_offset`

Sets the starting and ending offsets for text positioning on the X and Y axes, allowing for text transition across the image.
Input as integers. Example: start_x_offset = 10, end_x_offset = 50 moves the text from 10 pixels from the left to 50 pixels from the left across frames.

`start_rotation`, `end_rotation`

Defines the starting and ending rotation angles for the text, enabling it to rotate between these angles.
Input as integers in degrees. Example: start_rotation = 0, end_rotation = 180 rotates the text from 0 to 180 degrees across frames.

`start_font_size`, `end_font_size`

Sets the starting and ending font sizes for the text, allowing the text size to dynamically change across frames.
Input as integers representing the font size in points. Example: start_font_size = 12, end_font_size = 24 will gradually increase the text size from 12 to 24 points across the frames.

`animation_reset`

Dictates when the animation effect resets to its starting conditions.
- word: Resets animation with each new word.
- line: Resets animation at the beginning of each new line of text.
- never: The animation does not reset, but continues throughout.

`animation_easing`

Controls the pacing of the animation.
- Examples include linear, exponential, quadratic, cubic, elastic, bounce, back, ease_in_out_sine, ease_out_back, ease_in_out_expo.
- Each option provides a different acceleration curve for the animation, affecting how the text transitions and rotates.

`animation_duration`

The length of time each animation takes to complete, measured in frames.
A larger value means a slower, more gradual transition, while a smaller value results in a quicker animation.

`transcription_mode`

Determines how the transcribed text is applied across frames.
- word: Each word appears on its corresponding frame based on the transcription timestamps.
- line: Similar to word, but text is added line by line.
- fill: Continuously fills the frame with text, adding new words at their specific timestamps.

video2audio Node

Extracts frames and audio from a video file.

Required Inputs

video: Path the video file.
frame_limit: Maximum number of frames to extract from the video.
frame_start: Starting frame number for extraction.
filename_prefix: Prefix for naming the extracted audio file. (relative to .\ComfyUI-Mana-Nodes)

Outputs

frames: Extracted frames as image tensors.
frame_count: Total number of frames extracted.
audio: Path of the extracted audio file.
fps: Frames per second of the video.
height, width: Dimensions of the extracted frames.

speech2text Node

Converts spoken words in an audio file to text using a deep learning model.

Required Inputs

audio: Audio file path or URL.
wav2vec2_model: The Wav2Vec2 model used for speech recognition. (https://huggingface.co/models?search=wav2vec2)
spell_check_language: Language for the spell checker.
framestamps_max_chars: Maximum characters allowed until new framestamp lines created.

Optional Inputs

fps: Frames per second, used for synchronizing with video. (Default set to 30)

Outputs

transcription: Text transcription of the audio. (Should only be used as font2img transcription input)
raw_string: Raw string of the transcription without timestamps.
framestamps_string: Frame-stamped transcription.
timestamps_string: Transcription with timestamps.

Example Outputs

raw_string: Returns the transcribed text as one line.

THE GREATEST TRICK THE DEVIL EVER PULLED WAS CONVINCING THE WORLD HE DIDN'T EXIST

framestamps_string: Depending on the framestamps_max_chars parameter the sentece will be cleared and starts to build up again until max_chars is reached again.
- In this example framestamps_max_chars is set to 25.

"27": "THE",
"31": "THE GREATEST",
"43": "THE GREATEST TRICK",
"73": "THE GREATEST TRICK THE",
"77": "DEVIL",
"88": "DEVIL EVER",
"94": "DEVIL EVER PULLED",
"127": "DEVIL EVER PULLED WAS",
"133": "CONVINCING",
"150": "CONVINCING THE",
"154": "CONVINCING THE WORLD",
"167": "CONVINCING THE WORLD HE",
"171": "DIDN'T",
"178": "DIDN'T EXIST",

timestamps_string: Returns all transcribed words, their start_time and end_time in json format as a string.

[
  {
    "word": "THE",
    "start_time": 0.9,
    "end_time": 0.98
  },
  {
    "word": "GREATEST",
    "start_time": 1.04,
    "end_time": 1.36
  },
  {
    "word": "TRICK",
    "start_time": 1.44,
    "end_time": 1.68
  },
  {
    "word": "THE",
    "start_time": 2.42,
    "end_time": 2.5
  },
  {
    "word": "DEVIL",
    "start_time": 2.58,
    "end_time": 2.82
  },
  {
    "word": "EVER",
    "start_time": 2.92,
    "end_time": 3.04
  },
  {
    "word": "PULLED",
    "start_time": 3.14,
    "end_time": 3.44
  },
  {
    "word": "WAS",
    "start_time": 4.22,
    "end_time": 4.34
  },
  {
    "word": "CONVINCING",
    "start_time": 4.44,
    "end_time": 4.92
  },
  {
    "word": "THE",
    "start_time": 5.0,
    "end_time": 5.06
  },
  {
    "word": "WORLD",
    "start_time": 5.12,
    "end_time": 5.42
  },
  {
    "word": "HE",
    "start_time": 5.58,
    "end_time": 5.62
  },
  {
    "word": "DIDN'T",
    "start_time": 5.7,
    "end_time": 5.88
  },
  {
    "word": "EXIST",
    "start_time": 5.94,
    "end_time": 6.28
  }
]

string2file Node

Writes a given string to a text file.

Required Inputs

string: The string to be written to the file.
filename_prefix: Prefix for naming the text file. (relative to .\ComfyUI-Mana-Nodes)

audio2video Node

Combines a sequence of images (frames) with an audio file to create a video.

Required Inputs

audio: Audio file path or URL.
frames: Sequence of images to be used as video frames.
filename_prefix: Prefix for naming the video file. (relative to .\ComfyUI-Mana-Nodes)
fps: Frames per second for the video.

Outputs

video_file_path: Path to the created video file.

Example Workflows

Font Animation

These workflows are included in the example_workflows directory:

example_workflow_1.json

example_workflow_2.json

Speech Recognition

Font Licences

Personal Use: The included fonts are for personal, non-commercial use. Please refrain from using these fonts in any commercial project without obtaining the appropriate licenses.
License Compliance: Each font may come with its own license agreement. It is the responsibility of the user to review and comply with these agreements. Some fonts may require a license for commercial use, modification, or distribution.
Removing Fonts: If any font creator or copyright holder wishes their font to be removed from this repository, please contact us, and we will promptly comply with your request.

Font Links

Contributing

Your contributions to improve Mana Nodes are welcome! If you have suggestions or enhancements, feel free to fork this repository, apply your changes, and create a pull request. For significant modifications or feature requests, please open an issue first to discuss what you'd like to change.

ernestleft/ComfyUI-Mana-Nodes

ComfyUI-Mana-Nodes

Installation

Demo

To-Do

Nodes

font2img Node

Required Inputs

Optional Inputs

Outputs

Parameters Explanation

text

text_interpolation_options

start_x_offset, end_x_offset, start_y_offset, end_y_offset

start_rotation, end_rotation

start_font_size, end_font_size

animation_reset

animation_easing

animation_duration

transcription_mode

video2audio Node

Required Inputs

Outputs

speech2text Node

Required Inputs

Optional Inputs

Outputs

Example Outputs

string2file Node

Required Inputs

audio2video Node

Required Inputs

Outputs

Example Workflows

Font Animation

example_workflow_1.json

example_workflow_2.json

Speech Recognition

Font Licences

Font Links

Contributing

`text`

`text_interpolation_options`

`start_x_offset`, `end_x_offset`, `start_y_offset`, `end_y_offset`

`start_rotation`, `end_rotation`

`start_font_size`, `end_font_size`

`animation_reset`

`animation_easing`

`animation_duration`

`transcription_mode`