For training and evaluation a model, feel free to checkout this getting started tutorial, in which we train a small, 60M-parameter GPT model on a tiny subset of the Redpajama V2 dataset.
Create conda environment and activate it via
conda create -n modalities python=3.10
conda activate modalities
then, install the repository via
pip install -e .
If you want to contribute, have look at CONTRIBUTING.md
.
For running the training endpoint on multiple GPUs run CUDA_VISIBLE_DEVICES=2,3 torchrun --nnodes 1 --nproc_per_node 2 --rdzv-endpoint=0.0.0.0:29502 src/modalities/__main__.py run --config_file_path config_files/config.yaml
.
Or, if you are a VsCode user, add this to your launch.json
:
{
"name": "Torchrun Main",
"type": "python",
"request": "launch",
"module": "torch.distributed.run",
"env": {
"CUDA_VISIBLE_DEVICES": "0"
},
"args": [
"--nnodes",
"1",
"--nproc_per_node",
"2",
"--rdzv-endpoint=0.0.0.0:29503",
"src/modalities/__main__.py",
"run",
"--config_file_path",
"config_files/config.yaml",
],
"console": "integratedTerminal",
"justMyCode": true,
"envFile": "${workspaceFolder}/.env"
}
The mechanismn introduced to instantiate classes via type_hint
in the config.yaml
, utilizes
- Omegaconf to load the config yaml file
- Pydantic for the validation of the config
- ClassResolver to instantiate the correct, concrete class of a class hierarchy.
Firstly, Omegaconf loads the config yaml file and resolves internal refrences such as ${subconfig.attribue}
.
Then, Pydantic validates the whole config as is and checks that each of the sub-configs are pydantic.BaseModel
classes.
For configs, which allow different concrete classes to be instantiated by ClassResolver
, the special member names type_hint
and config
are introduced.
With this we utilize Pydantics feature to auto-select a fitting type based on the keys in the config yaml file.
ClassResolver
replaces large if-else control structures to infer the correct concrete type with a type_hint
used for correct class selection:
activation_resolver = ClassResolver(
[nn.ReLU, nn.Tanh, nn.Hardtanh],
base=nn.Module,
default=nn.ReLU,
)
type_hint="ReLU"
activation_kwargs={...}
activation_resolver.make(type_hint, activation_kwargs),
In our implmentation we go a step further, as both,
- a
type_hint
in aBaseModel
config must be of typemodalities.config.lookup_types.LookupEnum
and config
is a union of allowed concrete configs of base typeBaseModel
.config
hereby replacesactivation_kwargs
in the example above, and replaces it with pydantic-validatedBaseModel
configs.
With this, a mapping between type hint strings needed for class-resolver
, and the concrete class is introduced, while allowing pydantic to select the correct concrete config:
from enum import Enum
from pydantic import BaseModel, PositiveInt, PositiveFloat, conint, confloat
class LookupEnum(Enum):
@classmethod
def _missing_(cls, value: str) -> type:
"""constructs Enum by member name, if not constructable by value"""
return cls.__dict__[value]
class SchedulerTypes(LookupEnum):
StepLR = torch.optim.lr_scheduler.StepLR
ConstantLR = torch.optim.lr_scheduler.ConstantLR
class StepLRConfig(BaseModel):
step_size: conint(ge=1)
gamma: confloat(ge=0.0)
class ConstantLRConfig(BaseModel):
factor: PositiveFloat
total_iters: PositiveInt
class SchedulerConfig(BaseModel):
type_hint: SchedulerTypes
config: StepLRConfig | ConstantLRConfig
To allow a user-friendly instantiation, all class resolvers are defined in the ResolverRegistry
and build_component_by_config
as convenience function is introduced. Dependecies can be passed-through with the extra_kwargs
argument:
resolvers = ResolverRegister(config=config)
optimizer = ... # our example dependency
scheduler = resolvers.build_component_by_config(config=config.scheduler, extra_kwargs=dict(optimizer=optimizer))
To add a new resolver use add_resolver
, and the corresponding added resolver will be accessible by the register_key given during adding.
For access use the build_component_by_key_query
function of the ResolverRegistry
.
We use click as a tool to add new entry points and their CLI arguments. For this we have a main entry point from which all other entry points are started.
The main entry point is src/modalities/__main__.py:main()
.
We register other sub-entrypoints by using our main click.group
, called main
, as follows:
@main.command(name="my_new_entry_point")
See the following full example:
import click
import click_pathlib
@click.group()
def main() -> None:
pass
config_option = click.option(
"--config_file_path",
type=click_pathlib.Path(exists=False),
required=True,
help="Path to a file with the YAML config file.",
)
@main.command(name="do_stuff")
@config_option
@click.option(
"--my_cli_argument",
type=int,
required=True,
help="New integer argument",
)
def entry_point_do_stuff(config_file_path: Path, my_cli_argument: int):
print(f"Do stuff with {config_file_path} and {my_cli_argument}...)
...
if __name__ == "__main__":
main()
With
[project.scripts]
modalities = "modalities.__main__:main"
in our pyproject.toml
, we can start only main with modalities
(which does nothing), or a specific sub-entrypoint e.g. modalities do_stuff --config_file_path config_files/config.yaml --my_cli_argument 3537
.
Alternatively, directly use src/modalities/__main__.py do_stuff --config_file_path config_files/config.yaml --my_cli_argument 3537
.
The MemMapDataset
requires an index file providing the necessary pointers into the raw data file. The MemMapDataset
can create the index file lazily, however, it is advised to create it beforehand. This can be done by running
modalities create_memmap_index <path/to/jsonl/file>
The index will be created in the same directory as the raw data file. For further options you may look into the usage documentation via modalities create_memmap_index --help
.
The PackedMemMapDatasetContinuous
and PackedMemMapDatasetMegatron
require a packed data file. To create the data file, you first have to generate a MemMapDataset
index file as described above. Assuming the index and raw data are located in the same directory, you can simply execute the following command:
modalities create_packed_data <path/to/jsonl/file>
The packed data file will be created in the same directory as the raw data file. For further options you may look into the usage documentation via modalities create_packed_data --help
.
The packed data file is a bytestream containing both the tokenized data as well as an index denoting the start and length of the tokenized documents inside the bytestream. The data file consists of 3 concatenated parts:
header segment | data segment | index segment
- header segment: This section is a 8 bytes sized integer which encodes the length of the data segment in bytes.
- data segment: This section contains a concatenation of all documents in form of 4 bytes sized tokens. An end-of-sequence token is placed between consecutive documents.
- index segment: This section contains a pickled index which locates the documents inside the data segment.
The index is basically a list of tuples, where each tuple contains the start position and length in bytes for the
corresponding document, e.g.,
[(start_doc1, len_doc1), (start_doc2, len_doc2), ....]
.