/elasticsn

Parser and Elasticsearch indexer for Leo Laporte & Steve Gibson's radio show transcripts

Primary LanguageScalaApache License 2.0Apache-2.0

elasticsn

This is an ETL like program aimed to parse and index (in Elasticsearch) the transcripts from Leo Laporte & Steve Gibson's podcast Security now.

By doing so, it is possible to search for terms, phrases, ... within the text as well as extracting statistics.

e.g: When was the first time someone said "complexity is the enemy of security" in the show? On which episodes has it been repeated?

/tmp/phrase.json:

{
  "script_fields": {
    "airedOn": {
      "script": {
        "source": "new Date(doc['header.date'].getValue())"
      }
    }
  },
  "_source": ["header.number", "header.audio"],
  "sort": [
    {
      "header.date": {
        "order": "asc"
      }
    }
  ],
  "query": {
    "nested": {
      "path": "text",
      "query": {
        "match_phrase": {
          "text.line": "complexity is the enemy of security"
        }
      },
      "inner_hits": {}
      }
    }
  }
}

Client request using CURL:

curl -u USER:PASS \
    -X GET https://HOST:PORT/securitynow/episode/_search \
    -d @/tmp/phrase.json | \
    jq '.hits.hits[] | { episode: ._source.header.number, airedOn: .fields.airedOn[0], saidBy: .inner_hits.text.hits.hits[0]._source.speaker, phrase: .inner_hits.text.hits.hits[0]._source.line}'

...and BOOM!:

results

e.g: Who are the show hosts? Who intervenes more often? Who is more talkative?

Just use Kibana!

kibana stats

Both Leo and Steve are the main hosts and they spoke similar numbers of sentences per podcast, however, reviewing word count by author it is obvious that Steve leads the discurse during each session.

Indices created by the ETL Scala program

All episodes relevant information is stored under the securitynow index. However, there is a second index, securitynow_words, where each spoken word is stored as a keyword in order to perform aggregations in Kibana.

securitynow index

As stated above, this index should be enough to perform analysis and get all the information from search queries. In fact, it contains two mappings (episode and episodeLine) of which episode keeps the information of all the aired shows with all the interventions. It is just an indexed version of the parsed transcription, check its mapping:

"episode": {
        "properties": {
          "audioURL": {
            "type": "text"
          },
          "date": {
            "type": "date"
          },
          "header": {
            "properties": {
              "audio": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "date": {
                "type": "long"
              },
              "hosts": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "number": {
                "type": "long"
              },
              "site": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "title": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          },
          "notesURL": {
            "type": "text"
          },
          "number": {
            "type": "integer"
          },
          "speakers": {
            "type": "text"phrases
          },
          "text": {
            "type": "nested",
            "properties": {
              "line": {
                "type": "text"
              },
              "speaker": {
                "type": "keyword"
              }
            }
          },
          "title": {
            "type": "text"
          }
        }
      }

Note how each line of dialog is stored within the nested object text so Elasticsearch can query those lines as in the example shown at the beginning of this document.

As it would happen with the securitynow_words index, the episodeLine mapping serves Kibana including some keyword fields.

TODO

  • Unit tests.
  • Factor Upload object into a generic indexer which should allow modularization when creating different views, in the form of indices and mappings.
  • Replace ad-hoc indices and types for Kibana with aggregation results from securitynow/episode to hugely improve performance (time & space).
  • Add a "listener" aimed to make of this ETL a stream endpoint which could keep the index updated with the new shows.