speech perception and technology – Flashcards

Unlock all answers in this set

Unlock answers
question
Acoustic analysis shows that acoustic features are characteristic of certain categories of sounds (e.g. burst for stops, high F2 for front vowels, etc.) •However, just because these regularities exist doesn't mean that they are all used by listeners. •Acoustic cues are acoustic features that are used by listeners to perceive speech. •Researchers use many techniques to discover what features listeners use
answer
acoustic cues
question
Vowels: F1 and F2 valuesare the most important cues to vowel identity. -Listeners also use F3, F0, duration, formant transition, and slight changes in formants over time. •Diphthongs: Listeners use rate and direction of change in formantsmore than exact values. •Formant values are important to listeners, but vary by gender, context, speech rate relationshipsbetween formants and context, not just absolute values are important
answer
acoustic cues for vowels
question
Glides:listeners use target formant frequency and presence of rapid change in formant frequency.-Change is more rapid than in diphthongs. •Liquids: listeners used F3 valueand rapid change of F3to distinguish /r/ from /l/. •Nasals: listeners use formant transitionsto determine place of articulation of nasals; they use presence of strong F1 and weaker F2 and F3to distinguish nasals from other resonants
answer
cues for resonant consonants
question
Manner of articulation: listeners use the presence of silence or near-silenceand transientsto distinguish stops from other consonants (transients often missing in syllable-final stops, however) •Place of articulation: Listeners use formant transitions and the frequency of strongest energy in the transient (if present) to determine place of articulation of stops. •Initial stop voicing: listeners use the timebetween the transient and the vowel onset and presence/absence of aspirationto determine voicing of initial stops. •Final stop voicing: Listeners use presence/absence of voicing during closure, vowel duration, and stop closure duration to determine voicing of final stops
answer
cues for stop consonants
question
Vowels longer before voiced stops.•Context-related because neighboring consonant alters vowel duration
answer
a context related duration difference
question
Stop closure duration is longer for voiceless syllable-final stops than for voiced ones. 7 •Intrinsic because the phoneme itself causes the change
answer
intrinsic duration difference
question
Manner of articulation: listeners use the presence of continuous aperiodic (noise) energyto distinguish fricatives from other consonants. •Place of articulation: listeners use intensity, frequency of strongest energy and concentration/diffuseness ofthe noise, as well as formant transitionsto determine fricative place of articulation. •Voicing: listeners use fricative durationand presence/absence of periodicityto determine voicing
answer
cues for fricatives
question
Affricates: listeners use all the cues for both stops and fricatives to determine affricate place, manner and voicing. •Context: listeners use knowledge of context to adjust the way that they use acoustic cues.-Example: a listener's decision boundary for deciding whether a fricative was /s/ or /ʃ/ may be lower in frequency when the phoneme precedes /u/ than when it precedes /i
answer
affricate cues and context
question
Multiple acoustic cuesare available for each phoneme. There is no single acoustic "feature" that absolutely identifies a given phoneme. •Listeners know about context effectsand are able to adjust their "boundaries" between phonemes accordingly. •Together, these facts mean that speech perception is complicated. But it also means that speech perception is remarkably robust. •When one cue is missing or masked, listeners are able to use what is available
answer
summary of acoustic cues
question
Researchers use signal editingto remove or add portions of sounds-Examples: remove formant transition, burst, or change duration of silence. •Researchers use speech synthesisto changeacoustic features of sounds: can vary many feature (F0, formants, etc.) as desired, but labor intensive.-Synthetic continuum: Acoustic properties are varied in steps from target value for one phoneme to target value for another phoneme
answer
studying speech perception
question
Goal: to understand what speech cues listeners use and whether different groups use them differently. -Understand how language is processed/learned by normal listeners. -Understand differences/disorders. -Create better training methods. •How to study?-Type of experiment (identification or discrimination)-Type of stimuli (synthesized or natural
answer
speech perception expierments
question
Observation: formant transitions (e.g., from /d/ to /ɑ/) vary depending on place of articulation of stops. •Question: Do listeners use formant transitions to identify stop place? •Stimuli: a continuumof speech sounds is created using speech synthesis, with a range of F2 formant transitions from those appropriate for /bɑ/ (steeply rising) to those appropriate for /gɑ/(steeply falling). -Each new transition starts at a slightly higher value but ends at same value (the F2 for /ɑ/). -Important: the acoustic difference is the same for each "step
answer
consonant place of articulation expierement
question
Researchers synthesize speech sounds to make many steps from /ba/ to /da/ to /ga/ using only changes in F2 (everything else is the same). •For example, make 20 stimuli: in the first the F2 rises by about 250 Hz (from 900 to about 1150 Hz); the next would have F2 start about 25 Hz higher (925 to 1150), and so on; eventually flat and then falling
answer
stimuli for stop place of articulation expierements
question
Category boundary: where the function of identification is at 50%. •Discrimination: play pairs of stimuli. Some two steps apart; some exactly the same. Task is to say same or different. -Typical results: good discrimination only when two stimuli identified as different phonemes (either side of category boundary). (pg. 176). •This type of result is baffling because physical differences are equal across all steps
answer
place of articulation expierement
question
In each pair of sounds, the second formant starts at a higher frequency (by about 150 Hz) in the first sound in the pair than in the second sound in the pair. The sounds are completely the same except for the F2 transition. *The amount of difference between the sound pairs is the same but in one case it makes them sound different and in the other they sound the same
answer
same different examples
question
The combination of steep identification functions AND good discrimination only at category boundaries was labeled categorical perceptionby researchers.-See Figure 11.1-11.2 in book
answer
categorical perception
question
Create a continuum by varying both F1 and F2 to go from // to // to //. -Same type of id. and discrimination tasks as for consonants, but very different results-identification function NOT steeply sloping-Fairly good discrimination across the whole continuum. -NOT categorical
answer
vowel experiment
question
Researchers were baffled by categorical perception. Needed a way to explain it. -Why do consonants seem to be processed one way and vowels another? -Explained it by looking at articulation: continuous changes in vowel articulation possible but not so much in consonant articulation. •Motor theory: we identify phonemes through access to the underlying motor gestures that produced them, not directly through acoustic features (innate and special for humans
answer
motor theory of speech perception
question
Support: infants apparently born with categorical perception for some sounds.-Supports the idea that categorical perception is a "special" (innate) skill. •Support: sort of avoids problem of variability (gestures are consistent?). -No intermediate /d/-/g/ perception because no intermediate articulatory gesture (in English) •Problem: some non-human animals seem to have categorical perception. -So is it "special" for humans after all
answer
support and problems for motor theory
question
Auditory theory -patterns of speech perception can be explained based on auditory sensitivities to particular acoustic patterns or features. •Hypothesizes that listeners try to match sounds heard to an internal "acoustic template" (innate or learned). •Support: explains animal data (similar auditory system) •Problem: a lot of acoustic variation to explain
answer
alternative: auditory theories
question
Slit vs. Split: two cues for /p/ --period of silenceand rising formant transitions-Silence is more important because if silence is long enough, you don't need the transitions.-BUT it takes a shorter duration of silence to perceive /p/ when formants rising than when flat-In each case, the silent "gap" between /s/ and the voiced portion increases from 0 to 130 ms in 10-ms steps. At what step does your label change. In the previous slide, people hear the stimulus on the left as "split" more often than the one on the right because of the contribution of the rising formant transition
answer
cue integration
question
the phenomenon in which the presence of a second cue may offset a deficiency in a primary cue. -Takes a shorter "gap" (primary) to hear "split" when rising transition (secondary) also present. •Because we don't hear the effect consciously, we say that the two cues are perceptually integrated
answer
phonetic trading relationship
question
Integration of speech cues isn't only auditory. We also integrate visual information about speech (lip cues) unconsciously and seamlessly into perception of a phoneme. •In the McGurkeffect, conflicting visual (lip) and auditory (sound) cues fuse to create perception of a third category entirely
answer
multi modal integration
question
The visual stimulus showed you /g/ (what was produced in the video portion)-The audio of the /g/ was removed and replaced by the audio portion of a /b/. -When you watch and listen, you hear /d/, which is not the same as EITHER of the inputs you received (audio or video). -Perceptually, you've fused (or integrated) the cues into a sound that is intermediate between the two things that you received.-Without the video, you should perceive /b
answer
the mcgurk effect
question
How do you know what babies can discriminate before they can talk? -High amplitude sucking procedure: suck more strongly when interested wait till they get "bored" and then change stimulus, see if they notice. -Infant head turn procedure: monkey in the box. Following a change, monkey lights up. Infant will learn to look for the monkey when there is a change. Infants as young as a few days can discriminate most speech sounds (minimal pairs). -Infants can discriminate speech sounds that adults cannot (not in the "native" language). -Infants lose the ability to detect "non-native" differences by between 6 and 12 months of age. -By 9-12 months of age, infants have a preference for stress patterns of their native language and pauses that come at correct syntactic boundaries
answer
infant speech perception
question
By about age 3 or so, children can respond to tasks similar to those for adults, but with pictures and small number of trials. •What do we know?-Children have more trouble than adults in using incomplete speech cues (e.g., part of a formant transition). -Children don't use speech cues the same way as adults (focus primary attention on different cues -e.g, fricative energy vs. formant transition.-Children w/ phonological disorders need more information than TD peers phonological disorders involve phonetic perception differences too
answer
speech perception development
question
computer processing of speech
answer
speech processing technology
question
applications of speech processing (programs and devices for many purposes
answer
speech technology
question
producing intelligible speech via commands to a machine-Formant synthesis or concatenative
answer
speech synthesis
question
identifying phonemes or words via machine.
answer
automatic speech recognition
question
Purpose: (typically) take written text and convert to speech that is easily recognized by listener. -Text-to-speech (TTS). Major Tasks •morphology, syntax & prosody (affect how words are spoken -stress, phrasing, etc.) •print to phonetic symbols (spelling rules) •phonetic symbols to acoustic productions (acoustic cues & coarticulation effects
answer
speech synthesis
question
Formant synthesis by rule •Concatenative synthesis-Diphone based-Demisyllable based •Articulatory synthesis
answer
types of speech synthesis
question
Uses source-filter theory of speech production to create a digital sound source (buzz, hiss or combination) and filters that can be changed to produce the desired acoustic properties for any phoneme. -http://www.asel.udel.edu/speech/tutorials/synthesis/index.html-http://www.phonetics.ucla.edu/course/chapter8/speechbird/spee chbird.html •By hand: user must decide on frequency of F0 and formants, duration of sounds, and the timing and abruptness of changes
answer
formant synthesis
question
Specify what formant frequency values should be (either unchanging or must specify each point in time). •Source is created, goes through filters, output is a file. Create a new file for each stimulus
answer
using synthetic speech for research
question
Simple" vowel synthesis: create an "isolated" vowel (no change in frequency over time). -Control F0, F1, F2 and F3. Must use female-typical formant values for higher F0 to sound good. •General interface: use a "script" to specify changing formant and F0 values over time. -Can create formant transitions to create stops. •"Extended" interface: control many parameters (voice quality, frication, bandwidths of filters, etc
answer
synthesis examples
question
The parameter settings (F0, formant frequencies, voicing, duration, etc.) for each speech sound are written into a computer program. -Rules needed for individual phonemes. -Many sounds must have different rules for different syllable positions. -Rules for phoneme to phoneme transitions (coarticulation/assimilation, formant transitions) -Example: "saw me" (book). /a/ to /m/; /m/ to /i
answer
formant synthesis by rule
question
Small storage needs (1 computer program) for any number of voices (pro) •Requires a lot of background knowledge (con)-Must develop rules for each phoneme and transition to every other possible neighboring phoneme-Must develop rules for prosody, phrasing, stress rules, etc. •Has become better over the years: http://sal.shs.arizona.edu/~asaspeechcom/Contents.html •You can speed it up a lot and still understand it (pro
answer
formant synthesis by rule: pros and cons
question
Uses natural speech segmented at areas of "less variability" •Diphone-based: phoneme center to phoneme center (see demo)-Example: "tags" vs. "tugs." Minimal pair; each word has four phonemes. But each has five diphones and only share three. •Demisyllable-based: syllable onset to nucleus or nucleus to end (see demo)-"simpler" vs. "sampler" -each has two syllables and four demisyllables (two shared
answer
concatentive synthesis
question
Greater storage need: must store every possible two-phoneme combination as a separate file for each voice used (con). •Prosody may be too unvarying and there may be breaks where joining is poor (con).-To get better prosody, need new set of files for each variation (may be back to rules) •Hard to speed up appropriately (con). •Relatively easy to create (pro). Requires much less knowledge of speech acoustics
answer
concatentive synthesis pros and cons
question
Alternative approach to formant synthesis. •Parameters are based on acoustic consequences of articulatory positions. •Impossible combinations are not allowed, unlike for formant synthesis (pro) •Not enough knowledge about articulatory to acoustic mapping for it to work well yet (need more imaging of tongue and vocal tract and study of acoustic relations). (con
answer
articulatory synthesis
question
Speech synthesis applications •AAC (augmentative & alternative communication) for speech impaired (autistic, dysarthric, etc.); many types •Screen readers for the blind-formant synthesis preferred because they like up to 600 wpm •Voice response systems (phone/car/etc.) •Other automated/repetitive tasks-Example: weather reader •Interactive toys (Speak-n-Spell) •Create stimuli for speech research
answer
speech synthesis applications
question
Use of computer program to take acoustic input and identify words/phonemes. -Different from speech understanding •Major steps-Digitize speech input-Identify acoustic features in input (may correspond to different phonemes).-Select word with most matching features. •Requires all our knowledge about how listeners identify speech sounds. Still not as good as human listeners
answer
automatic speech recognition
question
Easier if words are separated slightly so that system knows where they are (human listener doesn't need this). •Variability is a big challenge: must recognize "same" sounds in different contexts/different talkers as the same. •Background noise is a much bigger problem than for human listeners. •Typically needs training on new talkers / words (depends on type of system
answer
speech recognizer issues
question
Degree of word segmenting-Isolated: words must be separated by 500 ms or more-Connected: words must be separated by only short pauses (Dragon system) -Continuous: no pauses needed -accepts normal conversational speech. •Vocabulary size-Small (200 words or less)-Large (200-1000 words) -Very large: > 1000 ; up to 20,000 words Speaker requirements-Speaker dependent: needs to be trained for each new talker. -Speaker independent: can recognize any talker (constrained usually by dialect, voice quality). Much harder, esp. for high accuracy •Type of system created depends on needs -can't do all the hard things together (see pg. 339), likely user population, and how important accuracy is
answer
approaches to speech recognition
question
Voice response systems: phone system (needs speaker independent, but small vocab. often okay) -menu systems. •Speech to text: typing by voice for mobility challenged or to avoid overuse injuries (Dragon Naturally Talking). Also for hearing impaired. •Computer-Based Speech Training Aids: decides whether talker achieved goal or not
answer
speech recognition application
Get an explanation on any task
Get unstuck with the help of our AI assistant in seconds
New