Pronunciation assessment not processing audiofiles.
LaBRat2022 opened this issue · 3 comments
Hello team and thanks for all the amazing services you offer!
I am currently using a django server model to implement a pronunciation assessment widget in which the user will provide the reference text to exaluate and submit an audio recording as a request. This is a simple implementation and being a basic skilled developer, I decided to approach this extending the python example you guys provide and implement it on my django website.
on the front end I am capturing the audio using javascript and sending it to my view function as a file which is saved in-server to be processed by the backend.
My issue comes when I try to send recorded audio, as the score always comes back flat down to 0. The file traffic does work in my opinion so far. I can listen to the audio both in the browser and directly from the file that the template view stored and they sound the same. I tested the file traffic Django-Azure by sending the PCM file provided in the examples and this one is recognized by azure and I get a score accordingly. Another error I get is "InitialSilenceTimeout".
Can you guys please give a look at my code and kindly point me in the right direction?
This is my view function:
@login_required(login_url="/login/")
def pronunciation(request):
if request.method == 'POST':
referenceText = request.POST.get('reference')
print(f"request.POST: {request.POST}")
print(f"request.FILES: {request.FILES}")
print(referenceText)
if referenceText == '':
resultJson = 'error'
context = {'result': resultJson}
html_template = loader.get_template('home/aaa.html')
return HttpResponse(html_template.render(context, request))
else:
# a common wave header, with zero audio length
# since stream data doesn't contain header, but the API requires header to fetch format information, so you need post this header as first chunk for each query
WaveHeader16K16BitMono = bytes([ 82, 73, 70, 70, 78, 128, 0, 0, 87, 65, 86, 69, 102, 109, 116, 32, 18, 0, 0, 0, 1, 0, 1, 0, 128, 62, 0, 0, 0, 125, 0, 0, 2, 0, 16, 0, 0, 0, 100, 97, 116, 97, 0, 0, 0, 0 ])
# a generator which reads audio data chunk by chunk
# the audio_source can be any audio input stream which provides read() method, e.g. audio file, microphone, memory stream, etc.
def get_chunk(audio_source, chunk_size=1024):
yield WaveHeader16K16BitMono
while True:
time.sleep(chunk_size / 32000) # to simulate human speaking rate
chunk = audio_source.read(chunk_size)
if not chunk:
global uploadFinishTime
uploadFinishTime = time.time()
break
yield chunk
# build pronunciation assessment parameters
pronAssessmentParamsJson = "{\"ReferenceText\":\"%s\",\"GradingSystem\":\"HundredMark\",\"Dimension\":\"Comprehensive\"}" % referenceText
pronAssessmentParamsBase64 = base64.b64encode(bytes(pronAssessmentParamsJson, 'utf-8'))
pronAssessmentParams = str(pronAssessmentParamsBase64, "utf-8")
# build request
url = "https://%s.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=en-us" % region
headers = { 'Accept': 'application/json;text/xml',
'Connection': 'Keep-Alive',
'Content-Type': 'audio/wav; codecs=audio/pcm; samplerate=16000',
'Ocp-Apim-Subscription-Key': subscriptionKey,
'Pronunciation-Assessment': pronAssessmentParams,
'InitialSilenceTimeoutMs' : '45000',
'Transfer-Encoding': 'chunked',
'Expect': '100-continue'
}
file = request.FILES['audio']
filename = default_storage.save(file.name, file)
audio_content = request.FILES['audio']
print(f"Read {len(audio_content)} bytes from audio file")
#audioFile = open(audio_content, 'rb')
file_path = os.path.join(os.getcwd(), filename)
relative_path = os.path.relpath(file_path)
print(relative_path)
audioFile = open('C:/djangoapp/core/media/'+relative_path, 'rb')
#audioFile = open('C://PronunciationAssessment/PronunciationAssessment/goodmorning.pcm', 'rb')
# send request with chunked data
response = requests.post(url=url, data=get_chunk(audioFile), headers=headers)
getResponseTime = time.time()
audioFile.close()
latency = getResponseTime - uploadFinishTime
print("Latency = %sms" % int(latency * 1000))
resultJson = json.loads(response.text)
print(json.dumps(resultJson, indent=4))
context = {'result': resultJson}
html_template = loader.get_template('home/aaa.html')
return HttpResponse(html_template.render(context, request))
and this is the widget that I am using to capture the audio and get the reference text:
<div class="col-md-6 grid-margin stretch-card">
<div class="card">
<div class="card-body">
<h4 class="card-title">Pronunciation Coach:</h4>
<form method="POST" action="{% url 'pronunciation' %}">
{% csrf_token %}
<div class="form-group">
<label for="word-input">Enter a word or sentence to practice</label>
<input type="text" class="form-control" id="word-input" name="reference">
</div>
<!-- Voice Recorder -->
<div id="voice-recorder">
<div class="row" style="margin-bottom: 15px;">
<button type="submit" class="btn btn-primary mb-2" style="margin-left: 12px">Submit</button>
<button id="start-recording" class="btn btn-success mb-2" style="margin-left: 12px ; margin-right: 15px" type="button">Start Recording</button>
<button id="stop-recording" class="btn btn-danger mb-2" style="display: none; margin-left: 12px; margin-right: 15px;" type="button">Stop Recording</button>
</div>
<audio id="audio-stream" controls name="audiox"></audio>
</div>
</form>
<div class="card-body">
{% if result == 'error' %}
<h4 class="card-text" style="color:crimson">Please enter a target word or sentence and click Submit</h4>
{% elif result %}
<h4 class="card-text">Total Score: {{ result.NBest.0.AccuracyScore }}</h4>
{% for word in result.NBest.0.Words %}
<h4>{{ word.Word }}</h4>
<ul>
{% for phoneme in word.Phonemes %}
{% if phoneme.AccuracyScore >= 90 %}
<span style="color: green">{{ phoneme.Phoneme }}</span>
{% elif phoneme.AccuracyScore >= 60 %}
<span style="color: yellow">{{ phoneme.Phoneme }}</span>
{% else %}
<span style="color: red">{{ phoneme.Phoneme }}</span>
{% endif %}
{% endfor %}
</ul>
{% endfor %}
{% else %}
{% if word %}
<h4 class="card-text">Sorry, the word you entered was not found in the dictionary.</h4>
{% endif %}
{% endif %}
</div>
<div class="card-body">
</div>
</div>
</div>
</div>
<script>
const startRecordingBtn = document.getElementById("start-recording");
const stopRecordingBtn = document.getElementById("stop-recording");
const audioStream = document.getElementById("audio-stream");
const form = document.querySelector('form');
let mediaRecorder;
let recordedChunks = [];
let recordedAudio;
startRecordingBtn.addEventListener("click", () => {
recordedChunks = [];
audioStream.src = '';
navigator.mediaDevices.getUserMedia({ audio: true })
.then(stream => {
mediaRecorder = new MediaRecorder(stream);
mediaRecorder.start();
mediaRecorder.addEventListener("dataavailable", event => {
recordedChunks.push(event.data);
});
mediaRecorder.addEventListener("stop", () => {
const audioBlob = new Blob(recordedChunks, { type: "audio/wav;" });
audioStream.src = URL.createObjectURL(audioBlob);
recordedAudio = audioBlob;
});
startRecordingBtn.style.display = "none";
stopRecordingBtn.style.display = "inline-block";
});
});
stopRecordingBtn.addEventListener("click", () => {
mediaRecorder.stop();
startRecordingBtn.style.display = "inline-block";
stopRecordingBtn.style.display = "none";
});
document.querySelector('form').addEventListener('submit', function(e) {
e.preventDefault();
const csrfToken = document.querySelector('input[name="csrfmiddlewaretoken"]').value;
const formData = new FormData();
formData.append('reference', document.querySelector('input[name="reference"]').value);
formData.append('audio', recordedAudio, "audio.wav");
const xhr = new XMLHttpRequest();
xhr.open('POST', '{% url "pronunciation" %}'.replace('%', '%%'), true);
xhr.setRequestHeader("X-CSRFToken", csrfToken);
xhr.send(formData);
});
</script>
Hi @LaBRat2022 well received for it, let me feedback to our team for investigation
Kerry hello and thank you so much for the support.
As of this moment I was able to get going with the SDK for python and got around the problem, we can close this one out!
Again thank you a bunch for your kind attention and support!
GG
No problem, thanks again for the feedback!