ChenghaoMou/text-dedup

Little refactor to allow imports from python instead of cli/subprocess

wuodar opened this issue · 3 comments

Currently, there is no real way to import deduplication algorithm and use it as a dependency in my python code without almost totally rewriting the content of main (the code under if __name__ == "__main__") - I think it would be beneficial to simply extract logic to sth. like main() function, and then in real "main" just construct parser, and pass parsed parameters to main.

It is definitely possible. But I will need some time testing them, as some scripts depend on global variables and specific multiprocessing setup.

Now it is possible to call each script's main function like this:

import click

from text_dedup.bloom_filter import main as bf_main
from text_dedup.utils import BloomFilterArgs
from text_dedup.utils import IOArgs
from text_dedup.utils import MetaArgs


ctx = click.Context(bf_main)
ctx.invoke(bf_main,
    io_args=IOArgs(
        path="allenai/c4",
        name="xh",
        split="train",
        cache_dir=".cache",
        output=".temp-output",
    ),
    meta_args=MetaArgs(
        column="text",
        batch_size=10000
    ),
    bloom_filter_args=BloomFilterArgs()
)

* the click context is necessary to bridge CLI and inline functions.

Of course, you can always call each script with subprocess:

import subprocess

subprocess.run(
    [
        "python",
        "-m",
        "text_dedup.minhash",
        "--path",
        "allenai/c4",
        "--name",
        "xh",
        "--split",
        "train",
        "--cache_dir",
        ".cache",
        "--output",
        ".temp-output",
        "--column",
        "text",
        "--batch_size",
        "10000",
    ],
    capture_output=True,
    text=True,
)

Stale issue message