clipperhouse/jargon

Pipe from Stdin instead of fetching the bytes

clipperhouse opened this issue · 2 comments

Currently, the jargon command line takes its input by specifying the source via flags.

  -f string
    	A file path to lemmatize
  -s string
    	A (quoted) string to lemmatize
  -u string
    	A URL to fetch and lemmatize

It occurs to me that jargon would play better simply by accepting Stdin.

There are already fine tools for reading files (cat) and fetching URLs (curl). jargon should just accept bytes piped from other tools.

Files

cat file.txt | jargon

replaces

jargon -f file.txt

URLs

curl https://example.com | jargon

replaces

jargon -u https://example.com

Strings

echo "I luv Rails" | jargon

replaces

jargon -s "I luv Rails"

@kevin-montrose suggested leaving both options open: support Stdin but also keep the flags. The theory is that the shell piping might be a perf hit vs a direct file read by jargon itself.

On my machine, I did it both ways, with a 22MB file:

time jargon -f ~/Downloads/cities1000.txt > /dev/null

real	0m2.470s
user	0m2.478s
sys	0m0.038s

time jargon -f ~/Downloads/cities1000.txt > /dev/null

real	0m2.460s
user	0m2.476s
sys	0m0.034s

time cat ~/Downloads/cities1000.txt | jargon > /dev/null

real	0m2.443s
user	0m2.466s
sys	0m0.049s

time cat ~/Downloads/cities1000.txt | jargon > /dev/null

real	0m2.450s
user	0m2.473s
sys	0m0.049s

I don’t see a significant difference tho of course this is just my machine, and not super rigorous.

New branch that allows both Stdin and flags: https://github.com/clipperhouse/jargon/compare/stdin-flags