Back Home Up Next                 

    My efforts are dedicated to my "Source of Inspiration..."

For OEM's       anim1058.gif (13911 bytes)  
BASICS ] Architecture ] Text To Speech ] Speech Recognition ] Microphones ] Test Cases ] [ For OEM's ] Future ] SAPI4.0 ] Wow ]

 

Search for:

For OEMs (PC, sound card, and voice modem manufactures)

Many sound cards and sound devices shipped today do not work well with Speech Recognition and Text-to-Speech because of a few small flaws in their design. Luckily, the flaws are easy to fix and significantly improve the usability of speech for the user. This document describes the work that an OEM should do to insure the speech recognition and text-to-speech work well on their sound device.

Note: This document references the Telex desktop microphone and Telex Nomad microphone as a means of testing the sound device. The intent is NOT to promote any particular microphone. Telex microphones are mentioned because they are widely available.

CPU Wave-in/out Functionality

This section describes the functionality that a sound device and microphone should have to produce good speech recognition and text-to-speech from the PC’s local microphone and speaker.

Wave Out

For text-to-speech to work well on a sound card, the sound card’s DAC should support:

16-bit, mono. 8 kHz, 11.25 kHz, 16.00 kHz, and 22.50 kHz sampling rates need to be supported by the DAC.
Volume control for the digital output also needs to be supported. Volume should be on a linear scale from 0x0000 to 0xffff. Since most mixer chips support decibels, the driver needs to convert between linear and logarithmic. 0xffff is -0.0 dB. 0x8000 is -6.0 dB. 0x4000 is -12.0 dB. Etc. Every halving of the volume is 6 dB less.
waveOutGetPos() needs to be accurate to within 1/60th second so that lip synchronization code works well.
The DMA ping-pong buffers should be less than or equal to 1/16th of a second apiece to insure fast response times for starting and stopping playback.

Wave In

For speech recognition to work well, the wave in driver should support:

16-bit mono. 8 kHz, 11.25 kHz, 16.00 kHz, and 22.50 kHz.
Volume control for the digital record also needs to be supported. Volume should be on a linear scale from 0x0000 to 0xffff. Since most mixer chips support decibels, the driver needs to convert between linear and logarithmic. 0xffff is -0.0 dB. 0x8000 is -6.0 dB. 0x4000 is -12.0 dB. Etc. Every halving of the volume is 6 dB less.
The attenuation should range from at least 0.0 dB to -24 dB. A range of 40 dB is desirable to accommodate more microphones.
The DMA ping-pong buffers should be less than or equal to 1/16th of a second apiece to insure fast response times for starting and stopping record. Smaller DMA buffers also improve speech recognition response time.
Enough gain so that when Telex Nomad is worn (or telex desktop is spoken to from 6" to 12" distance) and attenuation is -12 dB from max, user’s normal speech will peek VU at about -6DB from max. See below for information that we supply about the test application.
The combination of the ambient noise from the microphone and sound card should not be louder than -33 dB. It should be closer to -45 dB for dictation to work well. See below for information about the test application.
If the user speaks too loudly and clipping occurs, make sure that the signal as distorted as little as possible.
Watch out for 60 Hz hum.
Hardware automatic gain decreases speech recognition accuracy. It’s best to let the speech recognizer control the volume.
(Optional) The Microsoft Multimedia DDK describes a "low priority" wave-in driver which allows speech recognition to listen all of the time. If another application plays a sound or records, the speech recognition "device" is preempted while the other sound is playing/recording. When the other sound finishes, speech recognition regains control. If a sound device does not support low priority then no sounds can be played while speech recognition is listening. If low priority is supported, then modifications need to be made to the mixer. Make sure that low priority is thoroughly tested; many implementations that we've seen have been buggy.
(Optional) The wave-in and wave-out should be "full duplex" so it’s possible to record and play at the same time. However, most CODECs only handle full duplex of the same sampling rates. An ideal solution would be to allow recording at one sampling rate, and playback at another sampling rate.
(Optional) If the system is full duplex, it would be beneficial for echo-canceling chipsets to automatically eliminate the PC’s output waveform from the microphone. This ensures that the PC doesn’t listen to itself. The echo canceling should not only remove wave-out, but also MIDI and CD audio signals so that a user can speak to his computer while MIDI or CD music is playing.

Mixer

The sound card/device driver needs to support the mixer architecture, described in the Windows DDK. The mixer should have:

Volume should be on a linear scale from 0x0000 to 0xffff. Since most mixer chips support decibels, the driver needs to convert between linear and logarithmic. 0xffff is -0.0 dB. 0x8000 is -6.0 dB. 0x4000 is -12.0 dB. Etc. Every halving of the volume is 6 dB less.
Playback Line

Master volume control - Set to a reasonable level by default.

Wave-out volume - Set to a reasonable level by default.

Mute ability

(Optional) - Mixer control so the user can hear what the microphone hears. This can only be turned on if the user has headphones, otherwise a feedback loop will be produced. This should be off by default just in case the user plugs in speakers.

Recording Line

Microphone volume - Set to 0x4000 (-12 dB) by default.

Line-in volume - Some users will plug an amplified microphone signal into their line-in. Set to 0x4000 (-12 dB) by default.

Mixer control or multiplexer control to select between microphone and line-in. Set to the microphone by default.

If the volume is 0 then mute the incoming audio.

In general, when the user plugs in the microphone all defaults should be set so "it just works". See below for the test application.
(Optional) If the wave-in device supports "low priority" (see the Microsoft DDK) then it also needs to support a "Voice Commands" mixer line with the same functionality as the "Recording Line".

Microphone

The minimum microphone required is one that:

One that rests on the desktop, sits on top of the monitor, is built into the monitor, or clips on to the user’s shirt. Microphones built into the keyboard have problems because they pick up keyboard clicks. Microphones should be designed so that users place them no more than 18" from their mouth; The closer the better. The best microphones are: close-talk, ear-piece, or handset.
Basically the same frequency response as the Telex desktop or Nomad. Having a significantly different frequency response will result in lower out-of-the-box accuracy since most engine vendors have built their speaker-independent recognition models based upon the Telex microphone.
Directional (carteoid or hyper-carteoid). Omni-directional microphones pick up too much fan noise from the air-conditioning and CPU fan.
(Optional) Close-talk microphone. A close-talk microphone is worn on the user’s head and provides much better speech recognition accuracy than a desktop because the microphone element is close to the user’s mouth. When bundling a close-talk microphone, look for:

Near-field so it works better in noisy environments.

The microphone should be comfortable enough to wear for long periods of time.

It should be easy to put on and take off.

Make sure it doesn’t flop/move around when user moves his/her head.

The cord should be long enough so the user can lean back in his/her chair.

CPU Housing

The design of the CPU housing should have:

Users should be able to plug in Telex Nomad or Telex desktop directly, without the need for adapters and amplifiers. Some sound cards require adapters right now, causing the users unnecessary confusion.
Clearly label holes that microphone and headphones are plugged into. The jacks might even be in the front of the CPU so they’re easily accessed.
Use good connectors. Bad ones cause noise/static when move microphone is tugged slightly.
(Optional) Ultra-quiet fans reduce the ambient noise in the room and improve accuracy.
(Optional) The CPU or monitor should come with built-in sound so the user doesn’t have to do extra work to plug in speakers.

VU Test Application

The VU test application displays a VU meter and shows the latest peak signal level in dB. To run the test application click here.

Run it!

If a wave-in device is working well, the test application will produce the following results on a clean (new install of the sound driver) machine. These tests assume that a Telex Nomad microphone is worn, or that a Telex desktop microphone is about 12" away from the speaker’s mouth. The speaker is talking at a comfortable volume:

First of all, the VU meter applet must start up, and it’s VU must show that it hears the user.
The volume control slider must appear. If not the speech recognition cannot control the volume source. Move it up or down controls the volume of the microphone.
When the user is quiet, the peak level will be no more than -33 dB. For dictation to work well it needs to be around -45 dB. (The volume is assumed to be set to 0x4000).
When the user speaks, the peak level is between -9 dB and -3 dB. It should not clip. (The volume is assumed to be set to 0x4000.)
(Optional) A tester can bring up Sound Recorder (or some other application) and play a digital audio sound while the VU meter is running. In a low-priority system the VU audio will temporarily pause while the sound is being played. In a full-duplex system recording will continue uninterrupted while the sound is played.

Voice-Modem Wave-in/out Functionality

This section describes the functionality that a voice-modem sound device should have to produce good speech recognition and text-to-speech.

TAPI

The Voice Modem should support TAPI and allow the application to acquire the wave in/out device through TAPI.

Wave Out

For text-to-speech to work well on a voice model, the voice modem’s DAC should support:

16-bit, mono, 8 kHz (PCM) sampling rates need to be supported by the DAC.
Volume control for the digital output also needs to be supported. Volume should be on a linear scale from 0x0000 to 0xffff. Since most mixer chips support decibels, the driver needs to convert between linear and logarithmic. 0xffff is -0.0 dB. 0x8000 is -6.0 dB. 0x4000 is -12.0 dB. Etc. Every halving of the volume is 6 dB less.
Response time for opening, pausing, resuming, and closing the wave-out device should be very quick. Voice modems that use a serial connection do not provide fast enough response time.
Overhead for playing PCM data should be low. Voice modems that send compressed digital audio over the serial connection are too slow.
The DMA ping-pong buffers should be less than or equal to 1/16th of a second apiece to insure fast response times for starting and stopping playback.

Wave In

For speech recognition to work well, the voice modem’s wave in driver should support:

16-bit, mono, 8 kHz (PCM) sampling rates need to be supported by the ADC.
Volume control for the digital record also needs to be supported. Volume should be on a linear scale from 0x0000 to 0xffff. Since most mixer chips support decibels, the driver needs to convert between linear and logarithmic. 0xffff is -0.0 dB. 0x8000 is -6.0 dB. 0x4000 is -12.0 dB. Etc. Every halving of the volume is 6 dB less.
Response time for opening and closing the wave-in device should be very quick. Voice modems that use a serial connection do not provide fast enough response time.
Overhead for recording PCM data should be low. Voice modems that send compressed digital audio over the serial connection are too slow.
The DMA ping-pong buffers should be less than or equal to 1/16th of a second apiece to insure fast response times for starting and stopping record. Smaller DMA buffers also improve speech recognition response time.
If the user speaks too loudly and clipping occurs, make sure that the signal as distorted as little as possible.
Watch out for 60 Hz hum.
(Optional) The wave-in and wave-out should be "full duplex" so it’s possible to record and play at the same time. If the system is full duplex, it’s necessary for echo-canceling chipsets to automatically eliminate the output waveform from the input waveform. This ensures that the speech recognition doesn’t listen to audio being played out. The full duplex and echo-canceling feature allows the user to interrupt the computer while the computer is speaking a prompt.
 

Home ] Up ] BASICS ] Architecture ] Text To Speech ] Speech Recognition ] Microphones ] Test Cases ] [ For OEM's ] Future ] SAPI4.0 ] Wow ]

 

 

Please sign my guest book:

Send mail to askazad@hotmail.com with questions or comments about this web site.
Copyright © 2001 Engineered Station
Last modified: July 06, 2001

This site is been visited by Hit Counter    surfers