/html5-audio-read-along

HTML5 Audio Read-Along

Primary LanguageJavaScript

HTML5 Audio Read-Along

Screenshot of app

Jump straight to the live demo.

When I was in college, my most valuable tool for writing papers was a text-to-speech (TTS) program. I could paste in a draft of my paper and it would highlight each word as it was spoken, so I could give my proof-reading eyes a break and do proof-listening while I read along; I caught many mistakes I would have missed. Likewise, for powering through course readings I would copy the material into the TTS program whenever possible and speed up the reading rate; because the words are highlighted, it's easy to re-find your place if you look away and just listen for awhile. (I constantly use OS X's selected-text speech feature, but unfortunately it does not highlight words). A decade after my college days, I would have hoped that such TTS read-alongs would have become common on the Web (though there is work-in-progress Chrome API and a W3C draft spec now under development), even as read-along apps are prolific in places like the Apple App Store for kids books.

I created an intial version of this read-along demo in December 2009, so the passage I chose was naturally the nativity story, specifically from the Gospel of Luke in the English Standard Version (ESV). I chose a biblical passage not only in keeping with the Christmas spirit but also because the ESV has an excellent API which allows both a passage’s text and audio to be queried. With the text and audio in hand, each of the words in the text had to be time-indexed for its begin time and duration in the corresponding audio. In the past, audio Bibles were divided into chapter segments only and that was as granular as you could go; the ESV team did the innovation of taking this granularity down to the verse-level. Unfortunately, however, the granularity is not available at the word-level. Therefore, in order to make this read-along demo work, I manually traversed the audio to find each word’s begin time and duration, and I added these time indicies to the word markup as data-begin and data-dur attributes, akin to SMIL’s begin and dur attributes. (As an aside, it took me a tedious four hours to manually obtain the time indices for this passage. Because of the pain endured, in 2011 I set out to find a way to automate the process of finding time indexes, and I had some success which can be found in my ESV Text/Audio Aligner project.)

My Wish

The ultimate goal I would have for this demo would be that it would inspire e-book publishers to work toward adding read-along functionality to their applications. Specifically, I have my eye on Amazon here. I think it is tragic that Amazon now owns Audible and has access to a vast amount of high-quality audio books, but that they are disconnected from Amazon's vast array of e-books in the Kindle store. Amazon needs to work toward integrating Kindle and Audible into one product. When I purchase a Kindle book there should be an option to also purchase the Audible book as part of a package. Then, instead of only being able to use the Kindle device's TTS to listen to the book (it is frustrating how TTS is only available on the Kindle device and not from any Kindle apps), I should be able to listen to the Audible audio book while I am reading the Kindle book, all from the same Kindle app on any supported device, even on the Cloud Reader (for which this demo could be directly applied). Amazon would just have to align their Audible audio books with the respective texts in their Kindle e-books, and there are text-audio alignment tools available for this purpose, as mentioned above.

I imagine a Kindle/Audible app which would allow you to seamlessly switch between audio and visual reading modes. Think of listening to a book on your drive home from work, and then picking up where you left off at home with visual reading. If you're at an unimportant passage and start multitasking (e.g. in the kitchen), you could take your eyes off the screen and easily re-find your place since each word is highlighted as it is spoken. Furthermore, having the text-audio alignment would enable highlights and note-taking while just listening to the audio; there could be an app button, for example, that when pressed would cause the spoken audio to be highlighted in the Kindle book; likewise, there could be a button to add a voice memo to the book at that point and it would appear in the Kindle book as a text note via speech recognition/dictation. With integrated audio and text, new modes of reading would be enabled and reading would be much more accessible. Amazon and other e-book publishers, hear me!

Instructions

Upon playing the audio, the word in the text corresponding to the one currently being spoken in the audio is highlighted. When manually adjusting the seek position, the words which correspond to each audio position will be highlighted; and conversely, clicking a word causes the audio to seek to its corresponding position (and double-clicking will then cause it to start playing). Thus the text itself serves as an interface for navigating the audio. There is also a keyboard interface for navigating the text. Each word in the text is focusable, and upon tabbing to a word you may hit Enter to seek the audio to that point; there is also checkbox toggle for whether highlighted words should be auto-focused. Hitting Spacebar toggles play/pause.

Browser Support

The read-along demo works in the latest stable versions of Firefox, Chrome, Safari, and Opera (it may even work in Internet Explorer 9); I've also tested on iPhone and iPad (iOS 5). Safari and Chrome play the MP3 as served from the ESV API. Firefox doesn’t support MP3 so I include an OGG Vorbis source as well. There is also an 8kHz WAV fallback. Note that the speech rate control only works in browsers (e.g. Chrome and Safari) that implement HTML5MediaElement.playbackRate property; currently detection for playbackRate support in iOS is failing, so changing the range control in iOS will have no effect. Note that increasing the reading rate will decrease the accuracy of the word highlights since the words cease being spoken long enough for setTimeout to fire quickly enough.

Credits

Demo created by Weston Ruter (@westonruter), X-Team. Code is licensed MIT/GPL.

Scripture taken from The Holy Bible, English Standard Version. Copyright ©2001 by Crossway Bibles, a publishing ministry of Good News Publishers. Used by permission. All rights reserved. Data obtained from the ESV Bible Web Service.


I have put redirects from my blog to GitHub, so the comments on my blog are no longer accessible there. For archive purposes, I've posted the raw comments to gh-pages.