speech perception and technology – Flashcards
Unlock all answers in this set
Unlock answersquestion
            Acoustic analysis shows that acoustic features are characteristic of certain categories of sounds (e.g. burst for stops, high F2 for front vowels, etc.) •However, just because these regularities exist doesn't mean that they are all used by listeners. •Acoustic cues are acoustic features that are used by listeners to perceive speech. •Researchers use many techniques to discover what features listeners use
answer
        acoustic cues
question
            Vowels: F1 and F2 valuesare the most important cues to vowel identity. -Listeners also use F3, F0, duration, formant transition, and slight changes in formants over time. •Diphthongs: Listeners use rate and direction of change in formantsmore than exact values. •Formant values are important to listeners, but vary by gender, context, speech rate  relationshipsbetween formants and context, not just absolute values are important
answer
        acoustic cues for vowels
question
            Glides:listeners use target formant frequency and presence of rapid change in formant frequency.-Change is more rapid than in diphthongs. •Liquids: listeners used F3 valueand rapid change of F3to distinguish /r/ from /l/. •Nasals: listeners use formant transitionsto determine place of articulation of nasals; they use presence of strong F1 and weaker F2 and F3to distinguish nasals from other resonants
answer
        cues for resonant consonants
question
            Manner of articulation: listeners use the presence of silence or near-silenceand transientsto distinguish stops from other consonants (transients often missing in syllable-final stops, however) •Place of articulation: Listeners use formant transitions and the frequency of strongest energy in the transient (if present) to determine place of articulation of stops. •Initial stop voicing: listeners use the timebetween the transient and the vowel onset and presence/absence of aspirationto determine voicing of initial stops. •Final stop voicing: Listeners use presence/absence of voicing during closure, vowel duration, and stop closure duration to determine voicing of final stops
answer
        cues for stop consonants
question
            Vowels longer before voiced stops.•Context-related because neighboring consonant alters vowel duration
answer
        a context related duration difference
question
            Stop closure duration is longer for voiceless syllable-final stops than for voiced ones. 7 •Intrinsic because the phoneme itself causes the change
answer
        intrinsic duration difference
question
            Manner of articulation: listeners use the presence of continuous aperiodic (noise) energyto distinguish fricatives from other consonants. •Place of articulation: listeners use intensity, frequency of strongest energy and concentration/diffuseness ofthe noise, as well as formant transitionsto determine fricative place of articulation. •Voicing: listeners use fricative durationand presence/absence of periodicityto determine voicing
answer
        cues for fricatives
question
            Affricates: listeners use all the cues for both stops and fricatives to determine affricate place, manner and voicing. •Context: listeners use knowledge of context to adjust the way that they use acoustic cues.-Example: a listener's decision boundary for deciding whether a fricative was /s/ or /ʃ/ may be lower in frequency when the phoneme precedes /u/ than when it precedes /i
answer
        affricate cues and context
question
            Multiple acoustic cuesare available for each phoneme. There is no single acoustic "feature" that absolutely identifies a given phoneme. •Listeners know about context effectsand are able to adjust their "boundaries" between phonemes accordingly. •Together, these facts mean that speech perception is complicated. But it also means that speech perception is remarkably robust. •When one cue is missing or masked, listeners are able to use what is available
answer
        summary of acoustic cues
question
            Researchers use signal editingto remove or add portions of sounds-Examples: remove formant transition, burst, or change duration of silence. •Researchers use speech synthesisto changeacoustic features of sounds: can vary many feature (F0, formants, etc.) as desired, but labor intensive.-Synthetic continuum: Acoustic properties are varied in steps from target value for one phoneme to target value for another phoneme
answer
        studying speech perception
question
            Goal: to understand what speech cues listeners use and whether different groups use them differently. -Understand how language is processed/learned by normal listeners. -Understand differences/disorders. -Create better training methods. •How to study?-Type of experiment (identification or discrimination)-Type of stimuli (synthesized or natural
answer
        speech perception expierments
question
            Observation: formant transitions (e.g., from /d/ to /ɑ/) vary depending on place of articulation of stops. •Question: Do listeners use formant transitions to identify stop place? •Stimuli: a continuumof speech sounds is created using speech synthesis, with a range of F2 formant transitions from those appropriate for /bɑ/ (steeply rising) to those appropriate for /gɑ/(steeply falling). -Each new transition starts at a slightly higher value but ends at same value (the F2 for /ɑ/). -Important: the acoustic difference is the same for each "step
answer
        consonant place of articulation expierement
question
            Researchers synthesize speech sounds to make many steps from /ba/ to /da/ to /ga/ using only changes in F2 (everything else is the same). •For example, make 20 stimuli: in the first the F2 rises by about 250 Hz (from 900 to about 1150 Hz); the next would have F2 start about 25 Hz higher (925 to 1150), and so on; eventually flat and then falling
answer
        stimuli for stop place of articulation expierements
question
            Category boundary: where the function of identification is at 50%. •Discrimination: play pairs of stimuli. Some two steps apart; some exactly the same. Task is to say same or different. -Typical results: good discrimination only when two stimuli identified as different phonemes (either side of category boundary). (pg. 176). •This type of result is baffling because physical differences are equal across all steps
answer
        place of articulation expierement
question
            In each pair of sounds, the second formant starts at a higher frequency (by about 150 Hz) in the first sound in the pair than in the second sound in the pair. The sounds are completely the same except for the F2 transition. *The amount of difference between the sound pairs is the same but in one case it makes them sound different and in the other they sound the same
answer
        same different examples
question
            The combination of steep identification functions AND good discrimination only at category boundaries was labeled categorical perceptionby researchers.-See Figure 11.1-11.2 in book
answer
        categorical perception
question
            Create a continuum by varying both F1 and F2 to go from // to // to //. -Same type of id. and discrimination tasks as for consonants, but very different results-identification function NOT steeply sloping-Fairly good discrimination across the whole continuum. -NOT categorical
answer
        vowel experiment
question
            Researchers were baffled by categorical perception. Needed a way to explain it. -Why do consonants seem to be processed one way and vowels another? -Explained it by looking at articulation: continuous changes in vowel articulation possible but not so much in consonant articulation. •Motor theory: we identify phonemes through access to the underlying motor gestures that produced them, not directly through acoustic features (innate and special for humans
answer
        motor theory of speech perception
question
            Support: infants apparently born with categorical perception for some sounds.-Supports the idea that categorical perception is a "special" (innate) skill. •Support: sort of avoids problem of variability (gestures are consistent?). -No intermediate /d/-/g/ perception because no intermediate articulatory gesture (in English) •Problem: some non-human animals seem to have categorical perception. -So is it "special" for humans after all
answer
        support and problems for motor theory
question
            Auditory theory -patterns of speech perception can be explained based on auditory sensitivities to particular acoustic patterns or features. •Hypothesizes that listeners try to match sounds heard to an internal "acoustic template" (innate or learned). •Support: explains animal data (similar auditory system) •Problem: a lot of acoustic variation to explain
answer
        alternative: auditory theories
question
            Slit vs. Split: two cues for /p/ --period of silenceand rising formant transitions-Silence is more important because if silence is long enough, you don't need the transitions.-BUT it takes a shorter duration of silence to perceive /p/ when formants rising than when flat-In each case, the silent "gap" between /s/ and the voiced portion increases from 0 to 130 ms in 10-ms steps. At what step does your label change. In the previous slide, people hear the stimulus on the left as "split" more often than the one on the right because of the contribution of the rising formant transition
answer
        cue integration
question
            the phenomenon in which the presence of a second cue may offset a deficiency in a primary cue. -Takes a shorter "gap" (primary) to hear "split" when rising transition (secondary) also present. •Because we don't hear the effect consciously, we say that the two cues are perceptually integrated
answer
        phonetic trading relationship
question
            Integration of speech cues isn't only auditory. We also integrate visual information about speech (lip cues) unconsciously and seamlessly into perception of a phoneme. •In the McGurkeffect, conflicting visual (lip) and auditory (sound) cues fuse to create perception of a third category entirely
answer
        multi modal integration
question
            The visual stimulus showed you /g/ (what was produced in the video portion)-The audio of the /g/ was removed and replaced by the audio portion of a /b/. -When you watch and listen, you hear /d/, which is not the same as EITHER of the inputs you received (audio or video). -Perceptually, you've fused (or integrated) the cues into a sound that is intermediate between the two things that you received.-Without the video, you should perceive /b
answer
        the mcgurk effect
question
            How do you know what babies can discriminate before they can talk? -High amplitude sucking procedure: suck more strongly when interested wait till they get "bored" and then change stimulus, see if they notice. -Infant head turn procedure: monkey in the box. Following a change, monkey lights up. Infant will learn to look for the monkey when there is a change. Infants as young as a few days can discriminate most speech sounds (minimal pairs). -Infants can discriminate speech sounds that adults cannot (not in the "native" language). -Infants lose the ability to detect "non-native" differences by between 6 and 12 months of age. -By 9-12 months of age, infants have a preference for stress patterns of their native language and pauses that come at correct syntactic boundaries
answer
        infant speech perception
question
            By about age 3 or so, children can respond to tasks similar to those for adults, but with pictures and small number of trials. •What do we know?-Children have more trouble than adults in using incomplete speech cues (e.g., part of a formant transition). -Children don't use speech cues the same way as adults (focus primary attention on different cues -e.g, fricative energy vs. formant transition.-Children w/ phonological disorders need more information than TD peers phonological disorders involve phonetic perception differences too
answer
        speech perception development
question
            computer processing of speech
answer
        speech processing technology
question
            applications of speech processing (programs and devices for many purposes
answer
        speech technology
question
            producing intelligible speech via commands to a machine-Formant synthesis or concatenative
answer
        speech synthesis
question
            identifying phonemes or words via machine.
answer
        automatic speech recognition
question
            Purpose: (typically) take written text and convert to speech that is easily recognized by listener. -Text-to-speech (TTS). Major Tasks •morphology, syntax & prosody (affect how words are spoken -stress, phrasing, etc.) •print to phonetic symbols (spelling rules) •phonetic symbols to acoustic productions (acoustic cues & coarticulation effects
answer
        speech synthesis
question
            Formant synthesis by rule •Concatenative synthesis-Diphone based-Demisyllable based •Articulatory synthesis
answer
        types of speech synthesis
question
            Uses source-filter theory of speech production to create a digital sound source (buzz, hiss or combination) and filters that can be changed to produce the desired acoustic properties for any phoneme. -http://www.asel.udel.edu/speech/tutorials/synthesis/index.html-http://www.phonetics.ucla.edu/course/chapter8/speechbird/spee chbird.html •By hand: user must decide on frequency of F0 and formants, duration of sounds, and the timing and abruptness of changes
answer
        formant synthesis
question
            Specify what formant frequency values should be (either unchanging or must specify each point in time). •Source is created, goes through filters, output is a file. Create a new file for each stimulus
answer
        using synthetic speech for research
question
            Simple" vowel synthesis: create an "isolated" vowel (no change in frequency over time). -Control F0, F1, F2 and F3. Must use female-typical formant values for higher F0 to sound good. •General interface: use a "script" to specify changing formant and F0 values over time. -Can create formant transitions to create stops. •"Extended" interface: control many parameters (voice quality, frication, bandwidths of filters, etc
answer
        synthesis examples
question
            The parameter settings (F0, formant frequencies, voicing, duration, etc.) for each speech sound are written into a computer program. -Rules needed for individual phonemes. -Many sounds must have different rules for different syllable positions. -Rules for phoneme to phoneme transitions (coarticulation/assimilation, formant transitions) -Example: "saw me" (book). /a/ to /m/; /m/ to /i
answer
        formant synthesis by rule
question
            Small storage needs (1 computer program) for any number of voices (pro) •Requires a lot of background knowledge (con)-Must develop rules for each phoneme and transition to every other possible neighboring phoneme-Must develop rules for prosody, phrasing, stress rules, etc. •Has become better over the years: http://sal.shs.arizona.edu/~asaspeechcom/Contents.html •You can speed it up a lot and still understand it (pro
answer
        formant synthesis by rule: pros and cons
question
            Uses natural speech segmented at areas of "less variability" •Diphone-based: phoneme center to phoneme center (see demo)-Example: "tags" vs. "tugs." Minimal pair; each word has four phonemes. But each has five diphones and only share three. •Demisyllable-based: syllable onset to nucleus or nucleus to end (see demo)-"simpler" vs. "sampler" -each has two syllables and four demisyllables (two shared
answer
        concatentive synthesis
question
            Greater storage need: must store every possible two-phoneme combination as a separate file for each voice used (con). •Prosody may be too unvarying and there may be breaks where joining is poor (con).-To get better prosody, need new set of files for each variation (may be back to rules) •Hard to speed up appropriately (con). •Relatively easy to create (pro). Requires much less knowledge of speech acoustics
answer
        concatentive synthesis pros and cons
question
            Alternative approach to formant synthesis. •Parameters are based on acoustic consequences of articulatory positions. •Impossible combinations are not allowed, unlike for formant synthesis (pro) •Not enough knowledge about articulatory to acoustic mapping for it to work well yet (need more imaging of tongue and vocal tract and study of acoustic relations). (con
answer
        articulatory synthesis
question
            Speech synthesis applications •AAC (augmentative & alternative communication) for speech impaired (autistic, dysarthric, etc.); many types •Screen readers for the blind-formant synthesis preferred because they like up to 600 wpm •Voice response systems (phone/car/etc.) •Other automated/repetitive tasks-Example: weather reader •Interactive toys (Speak-n-Spell) •Create stimuli for speech research
answer
        speech synthesis applications
question
            Use of computer program to take acoustic input and identify words/phonemes. -Different from speech understanding •Major steps-Digitize speech input-Identify acoustic features in input (may correspond to different phonemes).-Select word with most matching features. •Requires all our knowledge about how listeners identify speech sounds. Still not as good as human listeners
answer
        automatic speech recognition
question
            Easier if words are separated slightly so that system knows where they are (human listener doesn't need this). •Variability is a big challenge: must recognize "same" sounds in different contexts/different talkers as the same. •Background noise is a much bigger problem than for human listeners. •Typically needs training on new talkers / words (depends on type of system
answer
        speech recognizer issues
question
            Degree of word segmenting-Isolated: words must be separated by 500 ms or more-Connected: words must be separated by only short pauses (Dragon system) -Continuous: no pauses needed -accepts normal conversational speech. •Vocabulary size-Small (200 words or less)-Large (200-1000 words) -Very large: > 1000 ; up to 20,000 words Speaker requirements-Speaker dependent: needs to be trained for each new talker. -Speaker independent: can recognize any talker (constrained usually by dialect, voice quality). Much harder, esp. for high accuracy •Type of system created depends on needs -can't do all the hard things together (see pg. 339), likely user population, and how important accuracy is
answer
        approaches to speech recognition
question
            Voice response systems: phone system (needs speaker independent, but small vocab. often okay) -menu systems. •Speech to text: typing by voice for mobility challenged or to avoid overuse injuries (Dragon Naturally Talking). Also for hearing impaired. •Computer-Based Speech Training Aids: decides whether talker achieved goal or not
answer
        speech recognition application
