Home Page Science Physics Acoustics speech perception and technology - Flashcards

speech perception and technology – Flashcards

Brenda Gannon

7 July 2022

Unlock all answers in this set

Unlock answers

question

Acoustic analysis shows that acoustic features are characteristic of certain categories of sounds (e.g. burst for stops, high F2 for front vowels, etc.) •However, just because these regularities exist doesn't mean that they are all used by listeners. •Acoustic cues are acoustic features that are used by listeners to perceive speech. •Researchers use many techniques to discover what features listeners use

answer

acoustic cues

question

Vowels: F1 and F2 valuesare the most important cues to vowel identity. -Listeners also use F3, F0, duration, formant transition, and slight changes in formants over time. •Diphthongs: Listeners use rate and direction of change in formantsmore than exact values. •Formant values are important to listeners, but vary by gender, context, speech rate relationshipsbetween formants and context, not just absolute values are important

answer

acoustic cues for vowels

question

Glides:listeners use target formant frequency and presence of rapid change in formant frequency.-Change is more rapid than in diphthongs. •Liquids: listeners used F3 valueand rapid change of F3to distinguish /r/ from /l/. •Nasals: listeners use formant transitionsto determine place of articulation of nasals; they use presence of strong F1 and weaker F2 and F3to distinguish nasals from other resonants

answer

cues for resonant consonants

question

Manner of articulation: listeners use the presence of silence or near-silenceand transientsto distinguish stops from other consonants (transients often missing in syllable-final stops, however) •Place of articulation: Listeners use formant transitions and the frequency of strongest energy in the transient (if present) to determine place of articulation of stops. •Initial stop voicing: listeners use the timebetween the transient and the vowel onset and presence/absence of aspirationto determine voicing of initial stops. •Final stop voicing: Listeners use presence/absence of voicing during closure, vowel duration, and stop closure duration to determine voicing of final stops

answer

cues for stop consonants

question

Vowels longer before voiced stops.•Context-related because neighboring consonant alters vowel duration

answer

a context related duration difference

question

Stop closure duration is longer for voiceless syllable-final stops than for voiced ones. 7 •Intrinsic because the phoneme itself causes the change

answer

intrinsic duration difference

question

Manner of articulation: listeners use the presence of continuous aperiodic (noise) energyto distinguish fricatives from other consonants. •Place of articulation: listeners use intensity, frequency of strongest energy and concentration/diffuseness ofthe noise, as well as formant transitionsto determine fricative place of articulation. •Voicing: listeners use fricative durationand presence/absence of periodicityto determine voicing

answer

cues for fricatives

question

Affricates: listeners use all the cues for both stops and fricatives to determine affricate place, manner and voicing. •Context: listeners use knowledge of context to adjust the way that they use acoustic cues.-Example: a listener's decision boundary for deciding whether a fricative was /s/ or /ʃ/ may be lower in frequency when the phoneme precedes /u/ than when it precedes /i

answer

affricate cues and context

question

Multiple acoustic cuesare available for each phoneme. There is no single acoustic "feature" that absolutely identifies a given phoneme. •Listeners know about context effectsand are able to adjust their "boundaries" between phonemes accordingly. •Together, these facts mean that speech perception is complicated. But it also means that speech perception is remarkably robust. •When one cue is missing or masked, listeners are able to use what is available

answer

summary of acoustic cues

question

Researchers use signal editingto remove or add portions of sounds-Examples: remove formant transition, burst, or change duration of silence. •Researchers use speech synthesisto changeacoustic features of sounds: can vary many feature (F0, formants, etc.) as desired, but labor intensive.-Synthetic continuum: Acoustic properties are varied in steps from target value for one phoneme to target value for another phoneme

answer

studying speech perception

question

Goal: to understand what speech cues listeners use and whether different groups use them differently. -Understand how language is processed/learned by normal listeners. -Understand differences/disorders. -Create better training methods. •How to study?-Type of experiment (identification or discrimination)-Type of stimuli (synthesized or natural

answer

speech perception expierments

question

Observation: formant transitions (e.g., from /d/ to /ɑ/) vary depending on place of articulation of stops. •Question: Do listeners use formant transitions to identify stop place? •Stimuli: a continuumof speech sounds is created using speech synthesis, with a range of F2 formant transitions from those appropriate for /bɑ/ (steeply rising) to those appropriate for /gɑ/(steeply falling). -Each new transition starts at a slightly higher value but ends at same value (the F2 for /ɑ/). -Important: the acoustic difference is the same for each "step

answer

consonant place of articulation expierement

question

Researchers synthesize speech sounds to make many steps from /ba/ to /da/ to /ga/ using only changes in F2 (everything else is the same). •For example, make 20 stimuli: in the first the F2 rises by about 250 Hz (from 900 to about 1150 Hz); the next would have F2 start about 25 Hz higher (925 to 1150), and so on; eventually flat and then falling

answer

stimuli for stop place of articulation expierements

question

Category boundary: where the function of identification is at 50%. •Discrimination: play pairs of stimuli. Some two steps apart; some exactly the same. Task is to say same or different. -Typical results: good discrimination only when two stimuli identified as different phonemes (either side of category boundary). (pg. 176). •This type of result is baffling because physical differences are equal across all steps

answer

place of articulation expierement

question

In each pair of sounds, the second formant starts at a higher frequency (by about 150 Hz) in the first sound in the pair than in the second sound in the pair. The sounds are completely the same except for the F2 transition. *The amount of difference between the sound pairs is the same but in one case it makes them sound different and in the other they sound the same

answer

same different examples

question

The combination of steep identification functions AND good discrimination only at category boundaries was labeled categorical perceptionby researchers.-See Figure 11.1-11.2 in book

answer

categorical perception

question

Create a continuum by varying both F1 and F2 to go from // to // to //. -Same type of id. and discrimination tasks as for consonants, but very different results-identification function NOT steeply sloping-Fairly good discrimination across the whole continuum. -NOT categorical

answer

vowel experiment

question

Researchers were baffled by categorical perception. Needed a way to explain it. -Why do consonants seem to be processed one way and vowels another? -Explained it by looking at articulation: continuous changes in vowel articulation possible but not so much in consonant articulation. •Motor theory: we identify phonemes through access to the underlying motor gestures that produced them, not directly through acoustic features (innate and special for humans

answer

motor theory of speech perception

question

Support: infants apparently born with categorical perception for some sounds.-Supports the idea that categorical perception is a "special" (innate) skill. •Support: sort of avoids problem of variability (gestures are consistent?). -No intermediate /d/-/g/ perception because no intermediate articulatory gesture (in English) •Problem: some non-human animals seem to have categorical perception. -So is it "special" for humans after all

answer

support and problems for motor theory

question

Auditory theory -patterns of speech perception can be explained based on auditory sensitivities to particular acoustic patterns or features. •Hypothesizes that listeners try to match sounds heard to an internal "acoustic template" (innate or learned). •Support: explains animal data (similar auditory system) •Problem: a lot of acoustic variation to explain

answer

alternative: auditory theories

question

Slit vs. Split: two cues for /p/ --period of silenceand rising formant transitions-Silence is more important because if silence is long enough, you don't need the transitions.-BUT it takes a shorter duration of silence to perceive /p/ when formants rising than when flat-In each case, the silent "gap" between /s/ and the voiced portion increases from 0 to 130 ms in 10-ms steps. At what step does your label change. In the previous slide, people hear the stimulus on the left as "split" more often than the one on the right because of the contribution of the rising formant transition

answer

cue integration

question

the phenomenon in which the presence of a second cue may offset a deficiency in a primary cue. -Takes a shorter "gap" (primary) to hear "split" when rising transition (secondary) also present. •Because we don't hear the effect consciously, we say that the two cues are perceptually integrated

answer

phonetic trading relationship

question

Integration of speech cues isn't only auditory. We also integrate visual information about speech (lip cues) unconsciously and seamlessly into perception of a phoneme. •In the McGurkeffect, conflicting visual (lip) and auditory (sound) cues fuse to create perception of a third category entirely

answer

multi modal integration

question

The visual stimulus showed you /g/ (what was produced in the video portion)-The audio of the /g/ was removed and replaced by the audio portion of a /b/. -When you watch and listen, you hear /d/, which is not the same as EITHER of the inputs you received (audio or video). -Perceptually, you've fused (or integrated) the cues into a sound that is intermediate between the two things that you received.-Without the video, you should perceive /b

answer

the mcgurk effect

question

How do you know what babies can discriminate before they can talk? -High amplitude sucking procedure: suck more strongly when interested wait till they get "bored" and then change stimulus, see if they notice. -Infant head turn procedure: monkey in the box. Following a change, monkey lights up. Infant will learn to look for the monkey when there is a change. Infants as young as a few days can discriminate most speech sounds (minimal pairs). -Infants can discriminate speech sounds that adults cannot (not in the "native" language). -Infants lose the ability to detect "non-native" differences by between 6 and 12 months of age. -By 9-12 months of age, infants have a preference for stress patterns of their native language and pauses that come at correct syntactic boundaries

answer

infant speech perception

question

By about age 3 or so, children can respond to tasks similar to those for adults, but with pictures and small number of trials. •What do we know?-Children have more trouble than adults in using incomplete speech cues (e.g., part of a formant transition). -Children don't use speech cues the same way as adults (focus primary attention on different cues -e.g, fricative energy vs. formant transition.-Children w/ phonological disorders need more information than TD peers phonological disorders involve phonetic perception differences too

answer

speech perception development

question

computer processing of speech

answer

speech processing technology

question

applications of speech processing (programs and devices for many purposes

answer

speech technology

question

producing intelligible speech via commands to a machine-Formant synthesis or concatenative

answer

speech synthesis

question

identifying phonemes or words via machine.

answer

automatic speech recognition

question

Purpose: (typically) take written text and convert to speech that is easily recognized by listener. -Text-to-speech (TTS). Major Tasks •morphology, syntax & prosody (affect how words are spoken -stress, phrasing, etc.) •print to phonetic symbols (spelling rules) •phonetic symbols to acoustic productions (acoustic cues & coarticulation effects

answer

speech synthesis

question

Formant synthesis by rule •Concatenative synthesis-Diphone based-Demisyllable based •Articulatory synthesis

answer

types of speech synthesis

question

Uses source-filter theory of speech production to create a digital sound source (buzz, hiss or combination) and filters that can be changed to produce the desired acoustic properties for any phoneme. -http://www.asel.udel.edu/speech/tutorials/synthesis/index.html-http://www.phonetics.ucla.edu/course/chapter8/speechbird/spee chbird.html •By hand: user must decide on frequency of F0 and formants, duration of sounds, and the timing and abruptness of changes

answer

formant synthesis

question

Specify what formant frequency values should be (either unchanging or must specify each point in time). •Source is created, goes through filters, output is a file. Create a new file for each stimulus

answer

using synthetic speech for research

question

Simple" vowel synthesis: create an "isolated" vowel (no change in frequency over time). -Control F0, F1, F2 and F3. Must use female-typical formant values for higher F0 to sound good. •General interface: use a "script" to specify changing formant and F0 values over time. -Can create formant transitions to create stops. •"Extended" interface: control many parameters (voice quality, frication, bandwidths of filters, etc

answer

synthesis examples

question

The parameter settings (F0, formant frequencies, voicing, duration, etc.) for each speech sound are written into a computer program. -Rules needed for individual phonemes. -Many sounds must have different rules for different syllable positions. -Rules for phoneme to phoneme transitions (coarticulation/assimilation, formant transitions) -Example: "saw me" (book). /a/ to /m/; /m/ to /i

answer

formant synthesis by rule

question

Small storage needs (1 computer program) for any number of voices (pro) •Requires a lot of background knowledge (con)-Must develop rules for each phoneme and transition to every other possible neighboring phoneme-Must develop rules for prosody, phrasing, stress rules, etc. •Has become better over the years: http://sal.shs.arizona.edu/~asaspeechcom/Contents.html •You can speed it up a lot and still understand it (pro

answer

formant synthesis by rule: pros and cons

question

Uses natural speech segmented at areas of "less variability" •Diphone-based: phoneme center to phoneme center (see demo)-Example: "tags" vs. "tugs." Minimal pair; each word has four phonemes. But each has five diphones and only share three. •Demisyllable-based: syllable onset to nucleus or nucleus to end (see demo)-"simpler" vs. "sampler" -each has two syllables and four demisyllables (two shared

answer

concatentive synthesis

question

Greater storage need: must store every possible two-phoneme combination as a separate file for each voice used (con). •Prosody may be too unvarying and there may be breaks where joining is poor (con).-To get better prosody, need new set of files for each variation (may be back to rules) •Hard to speed up appropriately (con). •Relatively easy to create (pro). Requires much less knowledge of speech acoustics

answer

concatentive synthesis pros and cons

question

Alternative approach to formant synthesis. •Parameters are based on acoustic consequences of articulatory positions. •Impossible combinations are not allowed, unlike for formant synthesis (pro) •Not enough knowledge about articulatory to acoustic mapping for it to work well yet (need more imaging of tongue and vocal tract and study of acoustic relations). (con

answer

articulatory synthesis

question

Speech synthesis applications •AAC (augmentative & alternative communication) for speech impaired (autistic, dysarthric, etc.); many types •Screen readers for the blind-formant synthesis preferred because they like up to 600 wpm •Voice response systems (phone/car/etc.) •Other automated/repetitive tasks-Example: weather reader •Interactive toys (Speak-n-Spell) •Create stimuli for speech research

answer

speech synthesis applications

question

Use of computer program to take acoustic input and identify words/phonemes. -Different from speech understanding •Major steps-Digitize speech input-Identify acoustic features in input (may correspond to different phonemes).-Select word with most matching features. •Requires all our knowledge about how listeners identify speech sounds. Still not as good as human listeners

answer

automatic speech recognition

question

Easier if words are separated slightly so that system knows where they are (human listener doesn't need this). •Variability is a big challenge: must recognize "same" sounds in different contexts/different talkers as the same. •Background noise is a much bigger problem than for human listeners. •Typically needs training on new talkers / words (depends on type of system

answer

speech recognizer issues

question

Degree of word segmenting-Isolated: words must be separated by 500 ms or more-Connected: words must be separated by only short pauses (Dragon system) -Continuous: no pauses needed -accepts normal conversational speech. •Vocabulary size-Small (200 words or less)-Large (200-1000 words) -Very large: > 1000 ; up to 20,000 words Speaker requirements-Speaker dependent: needs to be trained for each new talker. -Speaker independent: can recognize any talker (constrained usually by dialect, voice quality). Much harder, esp. for high accuracy •Type of system created depends on needs -can't do all the hard things together (see pg. 339), likely user population, and how important accuracy is

answer

approaches to speech recognition

question

Voice response systems: phone system (needs speaker independent, but small vocab. often okay) -menu systems. •Speech to text: typing by voice for mobility challenged or to avoid overuse injuries (Dragon Naturally Talking). Also for hearing impaired. •Computer-Based Speech Training Aids: decides whether talker achieved goal or not

answer

speech recognition application

speech perception and technology – Flashcards

Unlock all answers in this set

Haven't found what you were looking for?

Search for samples, answers to your questions and flashcards