Little refactor to allow imports from python instead of cli/subprocess
wuodar opened this issue · 3 comments
Currently, there is no real way to import deduplication algorithm and use it as a dependency in my python code without almost totally rewriting the content of main (the code under if __name__ == "__main__"
) - I think it would be beneficial to simply extract logic to sth. like main()
function, and then in real "main" just construct parser, and pass parsed parameters to main.
It is definitely possible. But I will need some time testing them, as some scripts depend on global variables and specific multiprocessing setup.
Now it is possible to call each script's main function like this:
import click
from text_dedup.bloom_filter import main as bf_main
from text_dedup.utils import BloomFilterArgs
from text_dedup.utils import IOArgs
from text_dedup.utils import MetaArgs
ctx = click.Context(bf_main)
ctx.invoke(bf_main,
io_args=IOArgs(
path="allenai/c4",
name="xh",
split="train",
cache_dir=".cache",
output=".temp-output",
),
meta_args=MetaArgs(
column="text",
batch_size=10000
),
bloom_filter_args=BloomFilterArgs()
)
* the click context is necessary to bridge CLI and inline functions.
Of course, you can always call each script with subprocess:
import subprocess
subprocess.run(
[
"python",
"-m",
"text_dedup.minhash",
"--path",
"allenai/c4",
"--name",
"xh",
"--split",
"train",
"--cache_dir",
".cache",
"--output",
".temp-output",
"--column",
"text",
"--batch_size",
"10000",
],
capture_output=True,
text=True,
)
Stale issue message