| |
For OEMs (PC, sound card, and voice modem manufactures)
Many sound cards and sound devices shipped today do not work well
with Speech Recognition and Text-to-Speech because of a few small flaws in their design.
Luckily, the flaws are easy to fix and significantly improve the usability of speech for
the user. This document describes the work that an OEM should do to insure the speech
recognition and text-to-speech work well on their sound device.
Note: This document references the Telex desktop microphone and
Telex Nomad microphone as a means of testing the sound device. The intent is NOT to
promote any particular microphone. Telex microphones are mentioned because they are widely
available.
CPU Wave-in/out Functionality
This section describes the functionality that a sound device and
microphone should have to produce good speech recognition and text-to-speech from the PCs
local microphone and speaker.
Wave Out
For text-to-speech to work well on a sound card, the sound cards
DAC should support:
| 16-bit, mono. 8 kHz, 11.25 kHz, 16.00 kHz, and 22.50 kHz sampling
rates need to be supported by the DAC. |
| Volume control for the digital output also needs to be supported.
Volume should be on a linear scale from 0x0000 to 0xffff. Since most mixer chips support
decibels, the driver needs to convert between linear and logarithmic. 0xffff is -0.0 dB.
0x8000 is -6.0 dB. 0x4000 is -12.0 dB. Etc. Every halving of the volume is 6 dB less.
|
| waveOutGetPos() needs to be accurate to within 1/60th second so that
lip synchronization code works well. |
| The DMA ping-pong buffers should be less than or equal to 1/16th of a
second apiece to insure fast response times for starting and stopping playback. |
Wave In
For speech recognition to work well, the wave in driver should
support:
| 16-bit mono. 8 kHz, 11.25 kHz, 16.00 kHz, and 22.50 kHz. |
| Volume control for the digital record also needs to be supported.
Volume should be on a linear scale from 0x0000 to 0xffff. Since most mixer chips support
decibels, the driver needs to convert between linear and logarithmic. 0xffff is -0.0 dB.
0x8000 is -6.0 dB. 0x4000 is -12.0 dB. Etc. Every halving of the volume is 6 dB less.
|
| The attenuation should range from at least 0.0 dB to -24 dB. A range
of 40 dB is desirable to accommodate more microphones. |
| The DMA ping-pong buffers should be less than or equal to 1/16th of a
second apiece to insure fast response times for starting and stopping record. Smaller DMA
buffers also improve speech recognition response time. |
| Enough gain so that when Telex Nomad is worn (or telex desktop is
spoken to from 6" to 12" distance) and attenuation is -12 dB from max, users
normal speech will peek VU at about -6DB from max. See below for information that we
supply about the test application. |
| The combination of the ambient noise from the microphone and sound
card should not be louder than -33 dB. It should be closer to -45 dB for dictation to work
well. See below for information about the test application. |
| If the user speaks too loudly and clipping occurs, make sure that the
signal as distorted as little as possible. |
| Watch out for 60 Hz hum. |
| Hardware automatic gain decreases speech recognition accuracy. Its
best to let the speech recognizer control the volume. |
| (Optional) The Microsoft Multimedia DDK describes a "low
priority" wave-in driver which allows speech recognition to listen all of the time.
If another application plays a sound or records, the speech recognition "device"
is preempted while the other sound is playing/recording. When the other sound finishes,
speech recognition regains control. If a sound device does not support low priority then
no sounds can be played while speech recognition is listening. If low priority is
supported, then modifications need to be made to the mixer. Make sure that low priority is
thoroughly tested; many implementations that we've seen have been buggy. |
| (Optional) The wave-in and wave-out should be "full duplex"
so its possible to record and play at the same time. However, most CODECs only
handle full duplex of the same sampling rates. An ideal solution would be to allow
recording at one sampling rate, and playback at another sampling rate. |
| (Optional) If the system is full duplex, it would be beneficial for
echo-canceling chipsets to automatically eliminate the PCs output waveform from the
microphone. This ensures that the PC doesnt listen to itself. The echo canceling
should not only remove wave-out, but also MIDI and CD audio signals so that a user can
speak to his computer while MIDI or CD music is playing. |
Mixer
The sound card/device driver needs to support the mixer
architecture, described in the Windows DDK. The mixer should have:
| Volume should be on a linear scale from 0x0000 to 0xffff. Since most
mixer chips support decibels, the driver needs to convert between linear and logarithmic.
0xffff is -0.0 dB. 0x8000 is -6.0 dB. 0x4000 is -12.0 dB. Etc. Every halving of the volume
is 6 dB less. |
| Playback Line Master volume control -
Set to a reasonable level by default.
Wave-out volume - Set to a reasonable level by default.
Mute ability
(Optional) - Mixer control so the user can hear what the microphone
hears. This can only be turned on if the user has headphones, otherwise a feedback loop
will be produced. This should be off by default just in case the user plugs in speakers.
|
| Recording Line Microphone volume - Set
to 0x4000 (-12 dB) by default.
Line-in volume - Some users will plug an amplified microphone signal
into their line-in. Set to 0x4000 (-12 dB) by default.
Mixer control or multiplexer control to select between microphone
and line-in. Set to the microphone by default.
If the volume is 0 then mute the incoming audio.
|
| In general, when the user plugs in the microphone all defaults should
be set so "it just works". See below for the test application. |
| (Optional) If the wave-in device supports "low priority"
(see the Microsoft DDK) then it also needs to support a "Voice Commands" mixer
line with the same functionality as the "Recording Line". |
Microphone
The minimum microphone required is one that:
| One that rests on the desktop, sits on top of the monitor, is built
into the monitor, or clips on to the users shirt. Microphones built into the
keyboard have problems because they pick up keyboard clicks. Microphones should be
designed so that users place them no more than 18" from their mouth; The closer the
better. The best microphones are: close-talk, ear-piece, or handset. |
| Basically the same frequency response as the Telex desktop or Nomad.
Having a significantly different frequency response will result in lower out-of-the-box
accuracy since most engine vendors have built their speaker-independent recognition models
based upon the Telex microphone. |
| Directional (carteoid or hyper-carteoid). Omni-directional
microphones pick up too much fan noise from the air-conditioning and CPU fan. |
| (Optional) Close-talk microphone. A close-talk microphone is worn on
the users head and provides much better speech recognition accuracy than a desktop
because the microphone element is close to the users mouth. When bundling a
close-talk microphone, look for: Near-field so it works
better in noisy environments.
The microphone should be comfortable enough to wear for long periods
of time.
It should be easy to put on and take off.
Make sure it doesnt flop/move around when user moves his/her
head.
The cord should be long enough so the user can lean back in his/her
chair.
|
CPU Housing
The design of the CPU housing should have:
| Users should be able to plug in Telex Nomad or Telex desktop
directly, without the need for adapters and amplifiers. Some sound cards require adapters
right now, causing the users unnecessary confusion. |
| Clearly label holes that microphone and headphones are plugged into.
The jacks might even be in the front of the CPU so theyre easily accessed. |
| Use good connectors. Bad ones cause noise/static when move microphone
is tugged slightly. |
| (Optional) Ultra-quiet fans reduce the ambient noise in the room and
improve accuracy. |
| (Optional) The CPU or monitor should come with built-in sound so the
user doesnt have to do extra work to plug in speakers. |
VU Test Application
The VU test application displays a VU meter and shows the latest
peak signal level in dB. To run the test application click here.
If a wave-in device is working well, the test application will
produce the following results on a clean (new install of the sound driver) machine. These
tests assume that a Telex Nomad microphone is worn, or that a Telex desktop microphone is
about 12" away from the speakers mouth. The speaker is talking at a comfortable
volume:
| First of all, the VU meter applet must start up, and its VU
must show that it hears the user. |
| The volume control slider must appear. If not the speech recognition
cannot control the volume source. Move it up or down controls the volume of the
microphone. |
| When the user is quiet, the peak level will be no more than -33 dB.
For dictation to work well it needs to be around -45 dB. (The volume is assumed to be set
to 0x4000). |
| When the user speaks, the peak level is between -9 dB and -3 dB. It
should not clip. (The volume is assumed to be set to 0x4000.) |
| (Optional) A tester can bring up Sound Recorder (or some other
application) and play a digital audio sound while the VU meter is running. In a
low-priority system the VU audio will temporarily pause while the sound is being played.
In a full-duplex system recording will continue uninterrupted while the sound is played.
|
Voice-Modem Wave-in/out Functionality
This section describes the functionality that a voice-modem sound
device should have to produce good speech recognition and text-to-speech.
TAPI
The Voice Modem should support TAPI and allow the application to
acquire the wave in/out device through TAPI.
Wave Out
For text-to-speech to work well on a voice model, the voice modems
DAC should support:
| 16-bit, mono, 8 kHz (PCM) sampling rates need to be supported by the
DAC. |
| Volume control for the digital output also needs to be supported.
Volume should be on a linear scale from 0x0000 to 0xffff. Since most mixer chips support
decibels, the driver needs to convert between linear and logarithmic. 0xffff is -0.0 dB.
0x8000 is -6.0 dB. 0x4000 is -12.0 dB. Etc. Every halving of the volume is 6 dB less.
|
| Response time for opening, pausing, resuming, and closing the
wave-out device should be very quick. Voice modems that use a serial connection do not
provide fast enough response time. |
| Overhead for playing PCM data should be low. Voice modems that send
compressed digital audio over the serial connection are too slow. |
| The DMA ping-pong buffers should be less than or equal to 1/16th of a
second apiece to insure fast response times for starting and stopping playback. |
Wave In
For speech recognition to work well, the voice modems wave in
driver should support:
| 16-bit, mono, 8 kHz (PCM) sampling rates need to be supported by the
ADC. |
| Volume control for the digital record also needs to be supported.
Volume should be on a linear scale from 0x0000 to 0xffff. Since most mixer chips support
decibels, the driver needs to convert between linear and logarithmic. 0xffff is -0.0 dB.
0x8000 is -6.0 dB. 0x4000 is -12.0 dB. Etc. Every halving of the volume is 6 dB less.
|
| Response time for opening and closing the wave-in device should be
very quick. Voice modems that use a serial connection do not provide fast enough response
time. |
| Overhead for recording PCM data should be low. Voice modems that send
compressed digital audio over the serial connection are too slow. |
| The DMA ping-pong buffers should be less than or equal to 1/16th of a
second apiece to insure fast response times for starting and stopping record. Smaller DMA
buffers also improve speech recognition response time. |
| If the user speaks too loudly and clipping occurs, make sure that the
signal as distorted as little as possible. |
| Watch out for 60 Hz hum. |
| (Optional) The wave-in and wave-out should be "full duplex"
so its possible to record and play at the same time. If the system is full duplex,
its necessary for echo-canceling chipsets to automatically eliminate the output
waveform from the input waveform. This ensures that the speech recognition doesnt
listen to audio being played out. The full duplex and echo-canceling feature allows the
user to interrupt the computer while the computer is speaking a prompt. |
|