You don’t need to submit anything for this lab
Spectrograms give us a plot of what frequencies are present in a recording through time: The vertical axis represents different frequencies, the horizontal axis represents time, the darkness of each coordinate (i.e., (time, frequency) point) in Praat represents the magnitude of that frequency in the input signal (waveform). The darker the spot, the higher the amplitude of that specific frequency (you can think of this as the volume or loudness of that specific frequency as a pure tone).
As discussed in class, applications like Praat generate spectrograms by applying the (Discrete) Fourier Transform to short windows in time. Changing the window size of that window changes what information you get out.
View the seashells
Sound object (and TextGrid annotations from last week if you have it)
You should see something like this:
Spectrogram
menu and go to Spectrogram settings...
Spectogram settings
pop up window with various parameters you can change:
View range (Hz)
: what frequencies are shown
Window length (s)
: The size of the windows we apply the Fourier Transform to
Dynamic range (dB)
: This basically tells you what loudness the colour white represents in the Praat spectrogram (frequencies with amplitude lower than a certain threshold all appear white, i.e. too quiet to notice)
We’re going to focus on the _window length.
Window Length
value to 0.05 (i.e., bigger window)Apply
Question: What changes to do see in the spectrogram?
Let’s look at some other (extreme) window lengths
Question: What happens when you change the window length to 0.001?
Question: What happens when you change the window length to 0.5? Can you distinguish different manners of articulation?
Question: Looking at all of these different settings, what seems to be the trade-off between time and frequency resolution?
What you should see is if the window is too big, we don’t see any detail of how the sound is changing in time: the spectrogram is blurred on the horizontal (time) axis. If the window is too small, we don’t see the frequency details: the spectrogram is blurred on the vertical axis. This is unavoidable issue in generating spectrograms: if we increase the time resolution we reduce the frequency resolution (and vice versa)
When we talk about narrow-band and wide-band spectra, we are talking about the amount of blurriness in the frequency.
Perceived pitch of (voiced) human speech corresponds to the rate of vocal fold vibration. You can think of every opening and closing cycle of the vocal folds as shooting a “pulse” of air through the vocal tract. The rate of these pulses determine the fundamental frequency (F0) of the speech at that point in time.
Terminology warning: We call F0 the acoustic correlate of perceived pitch. That is, F0 is the thing we measure from the waveform and it correlates pretty well with what we hear as changes in pitch. The mapping isn’t perfect though as speech perception is quite complicated in reality (the brain does a lot of complicated things with the information it gets from your ears!). You may see pitch and F0 used interchangeably in speech technology papers, but for the most part when we talk about pitch from an acoustic phonetics perspective, we’re actually talking about F0.
We can measure F0 directly from the waveform (i.e., the time domain) and from the frequency spectra that make up a spectrogram (i.e., the frequency domain).
We can measure the fundamental frequency (F0) of a waveform by counting the number of cycles it makes in a given period of time.
Task: Calculate the F0 at the middle of the [i] vowel of the word “She” in the seashells
recording.
If the frequency resolution is good enough, you can also measure F0 directly from the spectrogram: since F0 should be a prominent frequency component of the complex sound wave, so you should be able to see it as the first dark frequency band in a narrow-band spectrogram.
Task: Estimate F0 from a spectrogram and a spectral slice.
This is also known as the first harmonic of the voice source. But it’s bit hard to get accurate frequency measures on the spectrogram itself. So, let’s take a spectral slide instead.
Spectrogram
> View spectral slice
The is the frequency spectrum at the time corresponding to your cursor selection (more correctly a window of time around that point, where the length of that window is what you chose in the Spectrogram settings). You should see it’s quite spiky!
Question: Does this method give the same F0 value as you get from measuring the period of the waveform?
In the spectral slice itself, you should also see more spikes (i.e. boosted frequencies). These are resonances of the vocal source and we call these harmonics.
Question: What is the relationship (roughly) between the F0 of the vocal source and its harmonics?
We associate F0 with the speech source, specifically vocal fold vibrations. We can see this directly from the waveform, but we can also see the effect of these vibrations in the frequency spectrum.
As you may have noticed last week, we can also check our estimate against Praat’s pitch (i.e., F0) tracker:
Pitch
menu and turn on the pitch track (Show pitch
).
We’re not going to go over F0 tracking methods in this course, but you can find more in the Speech Synthesis (sem 2) course materials on speech.zone
Narrow-band spectrograms allow us to see the frequency harmonics due to the voice source. But when it comes to identifying vowels in English, we want to focus on the resonances of the vocal tract, i.e., the filter.
In this section, we’ll look at how changes in the vocal tract shape are visibile in the spectrogram. We’ll do this by looking at the estimated formants (i.e., vocal tract resonances) change for different vowels in a recording of a US English speaker. You can then record yourself and see how your vowel space compares!
Since we want to focus on the resonances of the filter, we’re interested in the overall shape of the frequency spectrum (i.e., the spectral envelope) rather than the fine frequency detail of the source harmonics. This means it can be more useful to use a wider band spectrogram (losing some harmonic detail).
It turns out the Praat defaults are pretty much tuned to this sort of analysis, so let’s change the spectrogram window settings back to the default.
Task: Extract the first and second formant values (F1 and F2) for a series of English words which differ only by vowel.
Download and open the following sound file in Praat: speechproc_phonlab2_rvowels.wav
Create a TextGrid for that sound object with a single Point Tier called vowel:
Open the corresponding Sound and TextGrid objects together (View & Edit
)
Change the spectrogram window length back to default: 0.005 seconds
Turn on the automatic formant tracker: Formants
> Show formants
Now let’s get some vowel formant measurements.
The words in the recording are:
You’ll want to zoom in enough that you can see the spectrogram of the word clearly.
Here’s the an example for the first word:
Question: Does taking the middle point of the vowel really make sense if the vowel is a diphthong (e.g. in the word “hide”)? What might be a better way of representing this?
For now, let’s just take the middle points but if you have extra time, you can think about you might better measure and plot diphthongs.
Click on Formants
> Formant Listing
This will pop up a window with estimated formant values. The first row is the variable name (Time_s, F1_Hz, F2_Hz, F3_Hz, F4_Hz). The second row shows the corresponding values. Here’s an example:
The formants values should correspond the spectral envelope peaks when you look at a Spectral Slice at the same time point. Check this for yourself!
Copy and paste the values to a spreadsheet (e.g. excel, google sheets)
Task: now that you have formant values for each of the words in the recording, you can make a plot of the vowel space!
You can use whatever method you like for this: e.g., Excel, R, python, online plotting tools (e.g., plotly chart studio), or pen and paper.
If you want to give the plot the same orientation as the IPA vowel chart, you’ll need to:
Task: Plot F1 and F2 measures for your own speech.
Questions:
One of the major issues for speech technologies is that pronunciation varies a lot! Depending on your language background, you may find that certain words are pronounced the same or different. Your ability to hear differences in pronunciation in others may also be affected by your own pronunciation.
Task: Compare your pronunciation of the words “marry”, “Mary”, and “merry” and see whether they are the same or different.
Extra: Compare your Marry/Mary/merry to some other speakers. Here’s some examples to start you off:
The goal of this lab was to start exploring how differences in speech acoustics can reflect how human speech is actually produced. We can get a lot of information from spectrograms about both the speech source (e.g., vocal fold vibrations) and the vocal tract resonances (e.g., the shapes you make with your articulators). We saw two types of resonances in speech that interact: F0 and its harmonics which are driven by vocal source, and formants which are driven by the vocal tract shape.
We can see the effects of these on the spectrogram. However, there is a tradeoff between time and frequency resolution. When we apply Fourier transform to increasingly shorter analysis windows, we lose frequency resolution. When we use longer windows, we get better frequency resolution but lose our ability to see the details in time.
We’ll go into more detail about what the Fourier Transform actually is in Module 3. We’ll then come back to look at the source filter model of speech through a more computational viewpoint in module 4.
The marry-merry-Mary merger is an example of tense-lax neutralisation. I first learned about this from Aaron Dinkin. You can learn about it more in his paper: (Dinkin 2004). No doubt though that a lot can change in people’s speech patterns in 20 years! This isn’t a course about language variation and change, but if you are going to do further in speech technology, you will need to deal with individual and group variation in speech.
This lab builds on a previous lab designed by Rebekka Puderbaugh.