/star_trek_transcript_search

transcripts of all of star trek and some commands to search them

Star Trek Transcript Search

Transcripts of all of Star Trek and some commands to search them

getting stared

Check out this repo, and put it somewhere you like:

$ git clone https://github.com/varenc/star_trek_transcript_search
$ cd star_trek_transcript_search/scripts
$ ls
DS9		Discovery	Enterprise	Movies		NextGen		TAS		TOS		Voyager

search commands

Now search the transcripts. Here's the most basic command for searching:

grep -rin "<search term>" .

For example:

$ grep -rin "terraform.*venus" .
./DS9/200457.txt:401:O'BRIEN: All of it. The Utopia Planitia yards on Mars, the terraforming stations on Venus, Starfleet Headquarters. I'm not detecting a single sign of Starfleet activity anywhere in this sector.

For a better experience, use The Silver Search (ag) instead of grep

$ ag "terraform.*venus" .
DS9/200457.txt
401:O'BRIEN: All of it. The Utopia Planitia yards on Mars, the terraforming stations on Venus, Starfleet Headquarters. I'm not detecting a single sign of Starfleet activity anywhere in this sector.

Make this an easily used function by adding this to your .bashrc or .zshrc

function trekLines() {
	cd /path/to/star_trek_transcript_search/scripts/
	ag "${1}" .
}

Then just call trekLines "terraform.*venus" to do a search.

There's lots more things you can do as well. Like count the number of lines per character per episode. Or find the episode where each character spoke the fewest number of words. May update this with examples of how to do that later.

fun examples

Get the average words per episode for each series

$ printf "%-15s %-12s %-12s %-12s\n" "SERIES" "TOTAL_WORDS" "EPISODES" "WORD_PER_EP"; for f in *; do W=$(cat $f/*.txt | wc -w); E=$(ls $f/*.txt | wc -l); printf "%-15s %-12s %-12s %-12s\n" "$f" $W $E $((${W}/ $E)); done
SERIES          TOTAL_WORDS  EPISODES     WORD_PER_EP
DS9             949778       173          5490
Discovery       69637        15           4642
Enterprise      472007       97           4866
Movies          93772        10           9377
NextGen         908469       176          5161
TAS             67066        22           3048
TOS             423886       79           5365
Voyager         960510       160          6003

(Note: not super accurate since the transcripts include some descriptions of what's happening on screen and the name of each speaker. Running this on subtitles instead of transcripts would be more accurate.)

Make a function to find the episodes where a chacter has the fewest/shortest lines, and then run it on Worf and then Tom Paris

$ trekQuietestEpisodesFor () {
	limit_num=${2:-10}
	for p in "$1"
	do
		echo -n "\n\n========= $p ========"
		for f in $(ag -ti "${p} ?(\[[\w\s]+\])?:" --count | sed "s/txt:/txt /"  | sort -nr -k 2 -k 1  | tail -n $limit_num | cut -d ' ' -f 1)
		do
			echo "\n=== Episode ${f:r} ==="
			cat "$f" | ag -ti "${p} ?(\[[\w\s]+\])?:"
		done
	done
}
$ trekQuietestEpisodesFor Worf 3

========= Worf ========
=== Episode DS9/200510 ===
WORF: Constable. Why are you talking to your beverage?

=== Episode DS9/200507 ===
WORF: We found him on top of the mountain, slumped over a subspace transmitter.

=== Episode DS9/200493 ===
WORF: I would.

$ trekQuietestEpisodesFor Paris 2

========= Paris ========
=== Episode Voyager/300622 ===
PARIS: Yes, ma'am.

=== Episode Voyager/300225 ===
PARIS: I'm picking up a lot of plasmatic turbulence in there. It might be a bumpy ride.

(Note: The above requires ag and probably zsh instead of bash)