Azure-Samples/Cognitive-Speech-TTS

Pronunciation assessment not processing audiofiles.

LaBRat2022 opened this issue · 3 comments

Hello team and thanks for all the amazing services you offer!

I am currently using a django server model to implement a pronunciation assessment widget in which the user will provide the reference text to exaluate and submit an audio recording as a request. This is a simple implementation and being a basic skilled developer, I decided to approach this extending the python example you guys provide and implement it on my django website.

on the front end I am capturing the audio using javascript and sending it to my view function as a file which is saved in-server to be processed by the backend.

My issue comes when I try to send recorded audio, as the score always comes back flat down to 0. The file traffic does work in my opinion so far. I can listen to the audio both in the browser and directly from the file that the template view stored and they sound the same. I tested the file traffic Django-Azure by sending the PCM file provided in the examples and this one is recognized by azure and I get a score accordingly. Another error I get is "InitialSilenceTimeout".

Can you guys please give a look at my code and kindly point me in the right direction?

This is my view function:

@login_required(login_url="/login/")
def pronunciation(request):

if request.method == 'POST':

    referenceText = request.POST.get('reference')
    print(f"request.POST: {request.POST}")
    print(f"request.FILES: {request.FILES}")
    print(referenceText)

    if referenceText == '':
        resultJson = 'error'
        context = {'result': resultJson}

        html_template = loader.get_template('home/aaa.html')
        return HttpResponse(html_template.render(context, request))

    else:
        
        # a common wave header, with zero audio length
        # since stream data doesn't contain header, but the API requires header to fetch format information, so you need post this header as first chunk for each query
        WaveHeader16K16BitMono = bytes([ 82, 73, 70, 70, 78, 128, 0, 0, 87, 65, 86, 69, 102, 109, 116, 32, 18, 0, 0, 0, 1, 0, 1, 0, 128, 62, 0, 0, 0, 125, 0, 0, 2, 0, 16, 0, 0, 0, 100, 97, 116, 97, 0, 0, 0, 0 ])

        # a generator which reads audio data chunk by chunk
        # the audio_source can be any audio input stream which provides read() method, e.g. audio file, microphone, memory stream, etc.
        def get_chunk(audio_source, chunk_size=1024):
          yield WaveHeader16K16BitMono
          while True:
            time.sleep(chunk_size / 32000) # to simulate human speaking rate
            chunk = audio_source.read(chunk_size)
            if not chunk:
              global uploadFinishTime
              uploadFinishTime = time.time()
              break
            yield chunk

        # build pronunciation assessment parameters

        pronAssessmentParamsJson = "{\"ReferenceText\":\"%s\",\"GradingSystem\":\"HundredMark\",\"Dimension\":\"Comprehensive\"}" % referenceText
        pronAssessmentParamsBase64 = base64.b64encode(bytes(pronAssessmentParamsJson, 'utf-8'))
        pronAssessmentParams = str(pronAssessmentParamsBase64, "utf-8")

        # build request
        url = "https://%s.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=en-us" % region
        headers = { 'Accept': 'application/json;text/xml',
                    'Connection': 'Keep-Alive',
                    'Content-Type': 'audio/wav; codecs=audio/pcm; samplerate=16000',
                    'Ocp-Apim-Subscription-Key': subscriptionKey,
                    'Pronunciation-Assessment': pronAssessmentParams,
                    'InitialSilenceTimeoutMs' : '45000',
                    'Transfer-Encoding': 'chunked',
                    'Expect': '100-continue'
                    }
        
        file = request.FILES['audio']
        filename = default_storage.save(file.name, file)

        audio_content = request.FILES['audio']
        print(f"Read {len(audio_content)} bytes from audio file")

        #audioFile = open(audio_content, 'rb')
        file_path = os.path.join(os.getcwd(), filename)
        relative_path = os.path.relpath(file_path)

        print(relative_path)

        audioFile = open('C:/djangoapp/core/media/'+relative_path, 'rb')
        #audioFile = open('C://PronunciationAssessment/PronunciationAssessment/goodmorning.pcm', 'rb')

        # send request with chunked data
        response = requests.post(url=url, data=get_chunk(audioFile), headers=headers)
        getResponseTime = time.time()
        audioFile.close()           

        latency = getResponseTime - uploadFinishTime
        print("Latency = %sms" % int(latency * 1000))

        resultJson = json.loads(response.text)
        print(json.dumps(resultJson, indent=4))

        context = {'result': resultJson}

        html_template = loader.get_template('home/aaa.html')
        return HttpResponse(html_template.render(context, request))

and this is the widget that I am using to capture the audio and get the reference text:

<div class="col-md-6 grid-margin stretch-card">
  <div class="card">
      <div class="card-body">

          <h4 class="card-title">Pronunciation Coach:</h4>

          <form method="POST" action="{% url 'pronunciation' %}">
              {% csrf_token %}
              <div class="form-group">
                  <label for="word-input">Enter a word or sentence to practice</label>
                  <input type="text" class="form-control" id="word-input" name="reference">
              </div>                 

              <!-- Voice Recorder -->
              <div id="voice-recorder">
                <div class="row" style="margin-bottom: 15px;">

                  <button type="submit" class="btn btn-primary mb-2" style="margin-left: 12px">Submit</button>
                  <button id="start-recording" class="btn btn-success mb-2" style="margin-left: 12px ; margin-right: 15px" type="button">Start Recording</button>
                  <button id="stop-recording" class="btn btn-danger mb-2" style="display: none; margin-left: 12px; margin-right: 15px;" type="button">Stop Recording</button>
                  
                </div>    
                               
                  <audio id="audio-stream" controls name="audiox"></audio>
                  
              </div>

          </form>

          <div class="card-body">
              {% if result == 'error' %}

                  <h4 class="card-text" style="color:crimson">Please enter a target word or sentence and click Submit</h4>

                  {% elif result %}
                  
                  <h4 class="card-text">Total Score:  {{ result.NBest.0.AccuracyScore }}</h4>

                  {% for word in result.NBest.0.Words %}
                  <h4>{{ word.Word }}</h4>
                  <ul>
                    {% for phoneme in word.Phonemes %}
                    
                  {% if phoneme.AccuracyScore >= 90 %}
                          <span style="color: green">{{ phoneme.Phoneme }}</span>
                  {% elif phoneme.AccuracyScore >= 60 %}
                          <span style="color: yellow">{{ phoneme.Phoneme }}</span>
                  {% else %}
                          <span style="color: red">{{ phoneme.Phoneme }}</span>
                  {% endif %}

                    {% endfor %}
                  </ul>
                {% endfor %}
                                                          
              {% else %}
                  {% if word %}
                      <h4 class="card-text">Sorry, the word you entered was not found in the dictionary.</h4>
                  {% endif %}
              {% endif %}
          </div>

          <div class="card-body">

          </div>

      </div>
  </div>
</div>

<script>

  const startRecordingBtn = document.getElementById("start-recording");
  const stopRecordingBtn = document.getElementById("stop-recording");
  const audioStream = document.getElementById("audio-stream");
  const form = document.querySelector('form');

  let mediaRecorder;
  let recordedChunks = [];
  let recordedAudio;

  startRecordingBtn.addEventListener("click", () => {

    recordedChunks = [];
    audioStream.src = '';
    navigator.mediaDevices.getUserMedia({ audio: true })
      .then(stream => {
        mediaRecorder = new MediaRecorder(stream);
        mediaRecorder.start();
      
        mediaRecorder.addEventListener("dataavailable", event => {
          recordedChunks.push(event.data);
        });
      
        mediaRecorder.addEventListener("stop", () => {
          const audioBlob = new Blob(recordedChunks, { type: "audio/wav;" });
          audioStream.src = URL.createObjectURL(audioBlob);
          recordedAudio = audioBlob;
        });
      
        startRecordingBtn.style.display = "none";
        stopRecordingBtn.style.display = "inline-block";
      });
  });

  stopRecordingBtn.addEventListener("click", () => {
    mediaRecorder.stop();
    startRecordingBtn.style.display = "inline-block";
    stopRecordingBtn.style.display = "none";
  });

  document.querySelector('form').addEventListener('submit', function(e) {
    e.preventDefault();

    const csrfToken = document.querySelector('input[name="csrfmiddlewaretoken"]').value;

    const formData = new FormData();
    formData.append('reference', document.querySelector('input[name="reference"]').value);
    formData.append('audio', recordedAudio, "audio.wav");
        
    const xhr = new XMLHttpRequest();
    xhr.open('POST', '{% url "pronunciation" %}'.replace('%', '%%'), true);
    xhr.setRequestHeader("X-CSRFToken", csrfToken);
    xhr.send(formData);
  });

</script>

Hi @LaBRat2022 well received for it, let me feedback to our team for investigation

Kerry hello and thank you so much for the support.

As of this moment I was able to get going with the SDK for python and got around the problem, we can close this one out!

Again thank you a bunch for your kind attention and support!

GG

No problem, thanks again for the feedback!