/data-scripts-star-wars

Useful repo for data-viz (e.g.) which contains the scripts of the Star Wars movies as well as refined versions with only the dialogues.

Primary LanguageRich Text FormatMIT LicenseMIT

Scripts from Star Wars movies

Table Of Contents

Overview

All the complete scripts of the first 6 Star Wars movies, as well as a light version of each containing only the dialogues and the places where they took place.

Likewise, you will find the images of each of the characters who had an interaction in the films.

In the folder data , are the CSV files of each film with information on the dialogues such as: the number of words per interaction, the types of words, the duration of the interaction, who speaks to whom and the location.

If you like the work then do not hesitate to visit my site.

Why this repo

This project was initially a personal project but with the work accomplished, I find it important to share with the community.

The idea for the Star Wars project came to me after a conference on data visualization at the KIKK festival in Namur in Belgium. The speaker (Nadieh Bremer) made me want to create a data-visualization and what could be better than the theme of Star Wars. My data-visualization will be visible later.

At first, I recovered the scripts of the first 6 films. I laid them out in markdown files in order to keep only the dialogues, the speakers and the places. I accompanied the script files with one file per film each time containing all the characters.

In the continuity of my work, I encoded the markdown files in HTML so that I could automatically extract the data I wanted with Javascript. At the same time, I created a small script that counted the words.

The second part of my work consisted of watching all the films and checking that everything was correct in terms of scripts.

This done, I encoded each film in a numbers file with several data including among others : the speaker, the interlocutor, the content, the duration, the place, the number of words, the type of words, etc. Thanks to the subtitle files, I was able to recover the duration of the talks and check all the data a second time.

Progression in the project

A small overview of the progress of the project.

  • Recover script files (3th January 2020)
  • Transcription and cleaning in markdown (8th January 2020)
  • Adding data in the sheet (8th January 2020)
  • Adding listeners
  • Adding durations
  • Adding sorts of words
  • Creating CSV files for each movie
  • Creating JSON files for each movie

What can you find here

In this repo, you can find several files about the Star Wars univers.

Folder Description
📂 Sources All source files that were used to collect the data
📂 Markdown Markdown files with dialogs, speakers and location
📂 JSON for sheet JSON files format to populate the Sheet file
📂 Data sheet Sheet file which gathers all the information for each film
📂 Data CSV CSV file ready to use for each film

Some code

My count words function used for the project :

function countWords(s){
    s = s.replace(/(^\s*)|(\s*$)/gi,"")
    s = s.replace(/[ ]{2,}/gi," ")
    s = s.replace(/[...]/gi," ")
    s = s.replace(/[(]+.+[)]/gi," ")
    s = s.replace(/\n /,"\n")
    return s.split(' ').filter(function(str){return str!="";}).length
}

Code to make the json

const uls = document.querySelectorAll('ul')
var global = []
var wordsGlobal = []
let where = null

let timePerWords = Math.round(83672 / 9595);

uls.forEach(ul => {
    let previousEl = ul.previousElementSibling
    let lis = [...ul.getElementsByTagName('li')]

    if (previousEl !== null) {
        if (previousEl.nodeName === "P") {
            where = previousEl.innerText
        }
    }

    lis.forEach(li => {
        let content = li.innerText.split(" : ")
        let number = 0
        let text = content[1]
        let textFormat = null
        let peoples = content[0].split(' to ')

        if (typeof text !== 'undefined') {
            textFormat = formatSentence(text)
            number = countWords(textFormat)
        }

        global.push({
            "from": peoples[0],
            "to": peoples[1],
            "text": text,
            "where": where,
            "number": number,
            "time": number * timePerWords
        })
    })
})

let data = JSON.stringify(global)

My PHP file to get the total speech time on screen based on the SRT

  $data = file_get_contents('./srt.txt', false);
  $res = preg_replace("/\*([0-9])\*/", "", $data);
  $res2 = preg_replace("/[^0-9:,>]/", " ", $res);
  $res3 = preg_replace('!\s+!', ' ', $res2);
  $res4 = preg_replace('/ , /', ' ', $res3);
  $res5 = preg_replace('/ > /', '>', $res4);
  $res6 = preg_replace('!\s+!', ' ', $res5);
  $res7 = explode(" ", $res6);

  $minutesGlobal = 0;
  $msGlobal = 0;

  foreach($res7 as $part) {
    $explode = explode('>', $part);

    if (sizeof($explode) == 2) {

      $explode[0] = preg_replace("/,/", ":", $explode[0]);
      $explodeNumbers = explode(':',$explode[0]);
      $explode[1] = preg_replace("/,/", ":", $explode[1]);
      $explodeNumbers2 = explode(':',$explode[1]);

      $ms = (int)$explodeNumbers2[3] - (int)$explodeNumbers[3];
      $secondes = (int)$explodeNumbers2[2] - (int)$explodeNumbers[2];
      $minutes = (int)$explodeNumbers2[1] - (int)$explodeNumbers[1];
      $hours = (int)$explodeNumbers2[0] - (int)$explodeNumbers[0];

      $minutesGlobal += $minutes;
      $msGlobal += $ms;
    }
  }

  echo $minutesGlobal . ' ' . $msGlobal;

Credits

The storyline, characters and images represented in this repo belong to the respective owners

Links Description
DISNEY All content belong to Disney
IMSDB Scripts from movies
YIFY Subtitles files
MARKDOWN TO HTML Convert markdown to html
JSON PARSER Parser the JSON
CSV-JSON Convert JSON to CSV
WORDCOUNTER Count word if needed
REGEXR For Regular Expressions
WORDPOS Part-of-speech for type of words