IBMStreams/streamsx.speech2text

Naming of some attributes/types are deceptive

Alex-Cook4 opened this issue · 5 comments

I haven't started changing attribute names yet because I want to try to keep the work backwards compatible for now. This issue is more for guidance when we decide we can really overhaul things.

Since I keep getting confused, I'm working on adding documentation on what each type means. I will call out attributes that I believe are misleading.

In the Utterance:

type  Utterance	=	
	tuple<rstring callId, 					// ipAddress + captureSeconds -> In an environment where
								// each speaker is on a separate RTP Streams, the callId
								// is effectively the ID for that speaker's stream. 
								// In the one-speaker to one-stream case, CTI correlation
								// must be done. 
		int32 utteranceNumber, 				// The utterance number for given RTP Stream, [0,1,...]
		float64 utteranceStartTime, 			// Seconds of audio processed for a given RTP Stream
								// up to start of the Utterance
        	float64 utteranceEndTime, 			// Seconds of audio processed for a given RTP Stream
								// up to end of the Utterance
        	uint32 captureSeconds,  			// This refers to the capture time in seconds of the first
								// RTP packet in the SSRC stream
        	rstring role, 					// role = "AGENT" -- this is currently useless
		rstring utterance, 				// The text of a single utterance
		int32 speakerId, 				// Not used - based on a channel id that is set to 0, since 
								// we only handle a single channel at a time
        	rstring callCenter, 				// ID for the call center the utterance is coming from
        	float64 utteranceConfidence, 			// Statistical confidence in the transcription of the utterance
        	list<float64>  utteranceTokenConfidences/*, 	// Statistical confidence in each token/word of the utterance
        	list<int32> utteranceSpeakers, 			// If using diarization, speaker of each token/word
		list<rstring> nBestHypotheses*/> ; 		// Alternative guesses for the utterance text

I recommend the following:

  • callId -> rtpStreamId: since this isn't actually the id of a call, it only has a single speaker. The true call id comes from CTI correlation and would have multiple of these "callId"s.
  • captureSeconds -> rtpStreamsStartTime: since it actually refers to the captureSeconds of the first packet in the RTP stream
  • role -> REMOVE: unless there are plans to support this in some way
  • speakerId -> REMOVE: unless there are plans to support this in some way

As I see other types/attributes I think could be cleaned up, I will add them to this issue.

@Alex-Cook4, depending on a specific environment, the role of a speaker (Agent/Client; Trader/Client) can possibly be retrieved and assigned to each individual rtpStreamId.
I'd suggest to keep it, but produce a default value such as Unassigned for environments, where it cannot be done. I also suggest to rename it from role into speakerRole.

Regarding the speakerId, isn't it required for environments where speaker diarization is required? In that case, there may be multiple different values of speakerId (1,2,3...) for each individual rtpStreamId.

I agree with renaming suggestions with one minor correction: rtpStreamStartTime, not rtpStreamsStartTime.

What'd you think?

@mgorbat Thanks for the input.
With regards to:

depending on a specific environment, the role of a speaker (Agent/Client; Trader/Client) can possibly be retrieved and assigned to each individual rtpStreamId.

Is that true coming from the RTP packets themselves? I understand that we can get that information from a CTI feed that we later correlate with the identifiers in the RTP Stream, but if that's what you're talking about, then those attributes shouldn't show up until later in my opinion.

Regarding:

Regarding the speakerId, isn't it required for environments where speaker diarization is required? In that case, there may be multiple different values of speakerId (1,2,3...) for each individual rtpStreamId.

Diarization results are currently placed in the list<int32> utteranceSpeakers since a given utterance may have multiple speaker identifiers.

@Alex-Cook4
my understanding is that in certain cases the range or list of ip addresses and ports present in the RTP packets and the direction of a stream can be identified as originating from an agent or a client. It is hard to say though how frequent those cases are and whether the direction of an RTP stream (forward/reverse) will not be lost after a network tap point.

Agree with removal of speakerId, as I haven't noticed there is an attribute for this already.

In that case, I'm fine with keeping the role attribute as an indicator that this is something that can potentially be set in customized situations. My updated proposal would be the following:

  • callId -> rtpStreamId: since this isn't actually the id of a call, it only has a single speaker. The true call id comes from CTI correlation and would have multiple of these "callId"s.
  • captureSeconds -> rtpStreamStartTime: since it actually refers to the captureSeconds of the first packet in the RTP stream
  • role -> keep, but document that it is currently unusable
  • speakerId -> REMOVE: unless there are plans to support this in some way

I would also like to add a:

  • rtpStreamComplete - boolean attribute to indicate that this is the last utterance from a stream