Recording the Verdi Requiem in Surround Sound and High

Recording the Verdi Requiem in
Surround Sound and High-Definition
David Griesinger
Harman Specialty Group – (till Nov. 21)
The Task:
• To make a first-rate recording of the Verdi Requiem from a
performance in a 1200 seat hall that was packed with people.
– The natural reverberation was not useful.
– The setup time was very limited.
– There were limits to the number of microphones that could be used, and
to their placement.
– The microphones had to be largely invisible to the video camera.
• For all these reasons a “natural” microphone technique could
not be used.
– The purpose of this talk is in part to show that with a thoughtful capture
of the direct sound, an entirely natural sound can be created.
– even when this sound did not exist during the performance.
Audio Goals
• 1. Clear reproduction of direct sound (low muddiness)
– The instruments and the chorus should not sound far-away
– “Sonic Distance” should be relatively low
• 2. Convincing sense of depth
– The chorus should sound behind the orchestra,
– The soloists should sound in the middle of the orchestra –where they
appear in the video.
– The orchestra should sound behind the loudspeakers – not close to the
• 3. High hall envelopment
– The low frequencies especially should surround the listener
– “Conductor’s perspective” draws the listener into the performance.
– The hall sound should be LARGE – matching the scale of the piece.
• 4. No “sweet-spot”
– sound should be excellent and nearly the same throughout the room.
– This requires use of the center channel and a LOW degree of
correlation between the channels at all frequencies.
– Leakage and panning always reduces the listening area.
• The goals of high envelopment and a large sweet spot
have similar requirements:
– The correlation between output channels needs to be low at
all frequencies.
• Thus we should avoid panning signals between channels.
– Closely-spaced microphone arrays nearly always produce
correlated signals – panning is inherent in the acoustic pickup.
• Reverberation is correlated, particularly at low frequencies.
– A large sweet spot demands high separation:
• Sounds produced on the left should NOT be reproduced by speakers
on the right!
• This is a frequent problem with so-called “main microphone arrays”
Example: Time delay panning outside
the sweet spot.
Record the orchestra with a
“Decca Tree” - three omni
microphones separated by one
meter. A source on the left will
give three outputs identical in
level and differing by time delay.
On playback, a listener on the far right
will hear this instrument coming from the
right loudspeaker. This listener will hear
every instrument coming from the right.
Amplitude panning outside the sweet
If you record with three widely spaced
microphones, an instrument on the left
will have high amplitude and time
differences in the output signals.
A listener on the far right will hear the
instrument on the left. Now the
orchestra spreads out across the entire
loudspeaker basis, even when the
listener is not in the sweet spot.
Training to hear envelopment
• To test for envelopment it is essential that you move
around the room, and that you face different directions!
• You must fill the WHOLE room with the sound of the
original hall, and it must work when you face all
• Reproducing the original hall only in front, or only in the
rear, will not do the job.
• The ability to reproduce the original hall acoustics in a
small space is one of the biggest advantages of 3/2
3/0 versus 3/2
It is obvious that using three speakers in the front is better than two speakers,
particularly if we use amplitude panning.
Why do we need two additional speakers and channels for the rear, particularly
if we are only reproducing reverberation?
Mono sounds poor because it
does not reproduce the spatial
properties of the original
recording space.
With decorrelated
reverberation a few spatial
properties come through, but
only if the listener faces
forward. And the sense of
space is stronger in the front.
We need at least four speakers
to reproduce a two
dimensional spatial sensation
that is uniform through the
The Polyhymnia Pentangle
• The Polyhymnia engineers employ a surround array of spaced omni
microphones, at a spacing similar to the ITU playback array.
• The technique works well in spaces where the reverberation radius is
equal to or greater than the microphone spacing.
• In this case the direct sound picked up by the rear microphones is
perceived as an early lateral reflection and the adds distance to the
front image.
• Caution!! In a small hall this array will be TOO MUDDY!!!
Video Goals
• 1. Minimalist Videography: No arbitrary selection of video
– If an instrument is playing, it should be visible on screen.
– The viewer should be able to decide which performer to watch,
– and can watch a different performers every time the video is seen.
2. Achieving the first goal requires sufficient resolution that
each performer can be seen at the same time!
– A 1280 by 720 pixel image is sufficient to convey the emotion from
over 100 performers at the same time.
– But if this goal is to be achieved with current video equipment great
care must be used!
• 3. A screen size appropriate to the audio image.
– The Verdi Requiem is a LARGE piece.
– It needs a Large screen if the video is to be effective.
• Ideally as large as the front loudspeaker basis.
Ground rules
• For this recording the hall and the musicians
union had a number of requirements:
– 1. There could be only ONE video recording.
– 2. There could be only ONE camera position.
– 3. There could be a maximum of 10 microphone
lines from the ceiling.
– 4. There was only 2 hours available for setup, both
for the audio and the video.
• At the last minute, my assistant could not come.
Concert Hall
Note the large, reverberant stage house in Jordan Hall, Boston
Stage house reverberation
• Reverberation in Jordan Hall (when fully occupied) is
dominated by the stage house.
• Reverberation radius (the distance at which
reverberation and direct sound are equal) is under 3
• Microphones MUST be placed close to performers or
the sound is muddy!!!
• Directional microphones are necessary, and
hypercardioid or supercardioid are helpful.
Why use a Main Microphone?
• Most engineers are taught that for classical music they MUST
use a main microphone.
• It is a PRIMARY rule in both science and art to always ask
WHY such a technique contributes to the artistic goals …
– What is the artistic or psychoacoustic reason to employ this device?
– For a small group – where the microphone is close to the musicians
compared to the reverberation radius – a main array can give good
– But it is also required that the natural acoustics are appropriate for the
piece being performed.
• When these conditions are not met… We must do something
more effective!
Main Microphone - NOT
• When the critical distance (hall radius) is smaller than the
microphone-to-source distance a main microphone does more
harm than good!
– It is NOT possible to record the direct sound from a large group with a single
microphone array!
– Main microphone arrays typically use omnidirectional microphones
• Omnis are only beneficial when the reverberation is both low in level and
• This is almost never true when a large group occupies a small hall.
• Omni microphone arrays cause the low frequencies to be monaural unless
they are widely spaced.
• Monaural bass is anathema to good sound.
• Using such an array would waste precious microphone lines.
• In this case we need to record the direct sound as best we can,
and use technology to give the sound both depth and
Directional Microphones only
4x – Schoeps CCM 40 Cardioid
4x – Schoeps CCM 41 Supercard
(note the simple cable adaptor)
2x – Neumann KMF-4 Cardioid
with stand adaptor
2x – Schoeps Collettte
Cardioid/Omni – set to Cardioid
Notice all microphones are small – easily concealed from the video camera.
Microphone placement
All microphones were on stage – none in the audience. All were hanging except
the soloist microphones, which used the Neumann stand adaptor on a short stand.
Where possible the microphones point toward the audience.
Audio Equipment
• The author believes in the validity of the sampling theorem:
– Which states that 44.1kHz is an adequate sampling rate to record
• The author also believes that frequencies above 20kHz have
NO musical importance.
– Please try to prove me wrong…
• Mixing 12 or 16 tracks together to create a 5 channels does
NOT decrease the signal to noise ratio of the final product.
– If we want a final product with a 16 bit S/N, we can achieve this result
with a 16 bit multitrack recording – if the original tracks are correctly
recorded and mixed.
• So – with no apologies, the author’s ancient Yahama O3D and
two Tascam machines were used for the recording.
– These machines are reliable and quick to set-up
Mixing setup
The original 12 tracks were played on two Tascam machines, mixed with the
O3D, and recorded on a third Tascam. Reverb used a Lexicon 480L for early
reflections, and a Lexicon MC-12 “Live” program for the main hall.
• Mix was done in real-time, using punch-in on the recording Tascam
– Synchronized punch-in allows for correction of the mix
– You can re-do individual sections until perfect
• Monitoring was on Infinity Prelude MTS speakers, with a Revel center
• The sound of these speakers in this room is fabulous.
• After mixing the sound was transferred digitally to a computer, where some
level adjustments and equalization was done.
– Most wind noise was removed by careful filtering at this stage.
• When the pitch of a soloist needed correction, a separate mix was made of
all the microphones without the soloist, with the soloist solo on a separate
– The pitch was then corrected in the computer, and the soloist was replaced into
the mix, while adding pitch-corrected early reflections.
Mixing Goals
• A mix has THREE basic elements:
– The direction and balance of the Direct Sound
– The perception of distance or depth in the sound
– The perception of the surrounding hall
• All three must be correct to make a great
– And all three can be separately adjusted by mixing.
Direct Sound
• Good directional localization over a large listening
area requires good separation between channels.
– We want to avoid leakage between microphones, and we
want to avoid pan pots where possible.
• In this case the orchestra microphones were aimed toward the
audience to avoid leakage from the chorus.
• This also avoided pickup of the nasty stage house reverberation.
– For this recording there were not enough microphone lines
to use dedicated center channel microphones
• So left-center and center-right panning was used for the soloists and
the center microphone pairs on the orchestra and chorus.
– Otherwise microphones were mixed into only a single
• The two microphones at the outside front of the orchestra were
directed to the surround channels
Center Channel
• All front microphones are panned center/left or center/right.
– No phantom image
– The center channel is vital to achieving a large listening area
– The center chorus microphone pair (CCM 41) and the soloist
microphones are panned in this way.
• The result is an even spread of the chorus from left front to right front
• The soloists are clearly localized half-way between the center and the left
or right loudspeakers.
• Beware a bug in Sony Vegas!
– The center surround channel is mixed equally to left and right front as
well as the center!
• A second bug causes clipping unless all channels are reduced in level.
• To fix both bugs I set the center channel to 0dB into the center output, and
all others to -6dB. The center is then added with negative phase into the
left and right front channels at a level of -6dB. This cancels the cross-talk.
Surround Channels
• To give added excitement and envelopment a
“conductor’s perspective” was used.
– The outer orchestra microphones (CCM 40) are directed to
rear left and rear right.
– The orchestra then surrounds the listener, with the chorus in
the front only.
– The woodwind microphones are panned to the front.
– This arrangement is particularly effective during the “Tuba
Miram” section where the offstage and onstage trumpets
sound all around the listener.
• This passage is the trumpet call that announces the end
of the world and the beginning of the last judgment.
• The recording invokes this emotion very well – and the
video strongly enhances the emotional power.
• Depth perception in a recording comes from early
reflections – both in the medial direction (mono) and
in other directions.
– In recording it is always preferable to use reflections from
other directions, as these add depth without muddiness.
– Ideally we want the early reflection field to be uniform
through the room
• Then the depth perception will be equal and natural for all the
• Thus we want similar reflection amplitude in all the outside
• Some commercial equipment generates this pattern by default.
– Use it! If you don’t have such a device, cobble it together from what
you have!
– You can use a pair of echo sends to separately control the
perceived depth of each element of the mix.
Early reflections and muddiness
Early reflections that come from a different direction from the direct sound add
depth and perspective.
– They can also add muddiness if there is too much
Early reflections that come from the same direction cause muddiness.
– Leakage – for example of the chorus into the orchestra mikes – adds muddiness because
the leakage is identical to an early reflection from the same direction.
– Thus we use close-miking to reduce the reflected energy in each microphone.
• And try to reduce leakage by microphone orientation and placement.
– We add early reflections electronically into the outer loudspeakers.
In practice, we control the depth perspective by adding early reflections using the
echo sends (in stereo) to the 480L running “large surround” with the reverb level
off and the early reflection level at maximum.
– The returns are routed to front L&R and to rear L&R
– The center channel is unused for reverberation and early reflections because it only adds
Muddiness: Dry Speech + 40ms
Mono speech:
The sound is clear,
but much too close
to the loudspeaker.
Speech with ~40ms
allpass reflections
and no direct sound.
Note both the mono
and the stereo
version sound
muddy and distant.
There is no phantom
image in the stereo
Reflections used in these experiments
The reflections used in these experiments form a decaying burst which peaks about
25ms after the direct sound, and has largely decayed away by 50ms.
The reflections are different in the two channels, and have a flat frequency response.
Depth without Muddiness
Dry speech
– Note the sound is uncomfortably close
Mix of dry with early reflections at -5dB.
– The mix has distance (depth), and is not muddy!
– Note there is no apparent reverberation, just depth.
Same but with the reflections delayed 20ms at -5dB.
– Note also that with the additional delay the reflections begin to be heard as discrete
• But the apparent distance remains the same.
Same but with the reflections delayed 50ms at -3dB
– Now the sound is becoming garbled. These reflections are undesirable!
– If the speech were faster it would be difficult to understand.
Same but with reflections delayed 150ms at -12dB
– I also added a few reflections between 20 and 80ms at a level of -8dB to
smooth the decay.
– Note the strong hall sense, and the lack of muddiness.
Demo Depth in the Mix
• Solo mikes alone
• Solo mikes with leakage
• Solo mikes with early reflections added
• Full mix
Hall and Envelopment
• Hall reflections (late reverberation) also needs to
come equally from all directions in the mix.
– Ideally the reverberation level and decay profile should be
the same in all the outer speakers.
– Once again – this type of reverberation output is available
in some commercial equipment be default.
– In a good hall, such a reverberation pattern is available
from the “Polyhymnia Pentangle” or the “Hamasaki
• Demo Decorrelated bass
The Ideal Reverberation
– has 20ms to 50ms reflections with a total energy -4dB
to -6dB
– has relatively little energy from 50 to 150ms.
Measured Early reflection amplitude
Impulse response of the direct sound and
early reflections as generated by the
Lexicon equipment. Impulse recorded
during the mix by sending a pulse into the
right soloist microphone channel.
Note the early reflections appear to be at a
very low level. This appearance is
If we integrate this picture with a 22.5ms
window (which is how the ear hears it) we
see the direct sound dominates the early
reflections, but not by much.
Experiments show the ideal level for the
total energy in the early reflections is -6dB
to -4dB. We see the levels used here are
close to this ideal.
Hall Reverberation
• The hall reverberation should be primarily LATE
• The ideal reverberation has high values of very early
reflected energy, followed by a strong late decay.
• Using the 480L, a “spread” value over 100 is
• The MC-12 “LIVE” program sounds better. I used a
“shape” value of 3, and a “spread” value of over 100.
The “size” was set to 32, and the RT was 1.9 seconds.
• Demo – best to hear it!!
Audio Secrets 1
Directional microphones
roll off the low frequency
response predictably.
The response of each
microphone is measured
at a distance of ~3
Each microphone is
equalized using the
measured data at the time
of the recording. This
curve is for the CCM 40.
The result is excellent bass response with directional microphones
Audio Secrets 2
This is the equalization
applied to the Hall reverb
return for the rear channels.
Note that the Low
frequencies are boosted
below about 150Hz, and the
high frequencies are reduced
above 4kHz.
This equalization keeps
envelopment high, while
preventing localization of the
reverb to the rear.
The reverb return to the front
left and right boosts the bass,
but the treble is flat
Audio Secrets 3
Typical loudspeaker
response rolls off the
Since the microphones are
flat, it is useful to boost
the high frequencies in the
front channels.
A small amount of bass
boost is also added –
beyond what is needed to
correct for the directional
• Western classical music consists of many musical lines of
equal importance.
• Conventional video assumes the viewer has a tiny, low
resolution video screen, and a short attention span.
– The result is brief close-up pictures of a single violin, alternating
with the conductor’s nose, or the tongue of an opera singer.
• We take another approach:
– What resolution is needed to convey all the musical lines?
– What screen size?
– We assume the viewer is interested in all the music, and may want
to view the performance several times.
– We need to see all the performers, all the time, and let the viewer
decide which performers to watch.
• Demo – PCP Schostakovich. Four performers on stage,
DVD quality.
• With 100+ people on stage, we need HIGH DEFINITION
High Definition
• US HDTV broadcast is “1080i”
– Typically this means a 16x9 picture with 1440 horizontal pixels, and
1080 vertical pixels.
• This implies a rectangular pixel, with greater resolution vertically than
– The Horizontal lines are interleaved, with 540 lines per field, and 60
– Interleaved fields only work well with CRT projectors, and not with
any digital display.
• All digital displays must “deinterlace” the picture.
• The most common digital display format is “720p”
– 720p has 1280 horizontal pixels by 720 vertical pixels, using a square
pixel for a 16x9 picture.
• HD cameras come in two types – 1080i, and 720p. In practice
both yield about the same resolution.
• Resolution – the number of lines a camera (or
display) can reproduce – depends on intrinsic
resolution and on contrast. Some factors are:
The sharpness of the lens and the accuracy of focus
The number of lines in the sensor
The bandwidth of the video readout circuits
The method of video compression
The noise level in the sensor
How the sensor data is read out.
• All these factors affect resolution and contrast!
HD professional vs HDV consumer
• Professional HD cameras use three sensor chips, typically 2/3”
– These chips are expensive, and require large, relatively expensive
– The advantage is that a large lens gathers a lot of light. Each pixel in
the sensor gets a healthy number of photons.
• The result is low video noise.
• HDV (consumer) cameras use 3/8” sensors.
– The lenses are smaller, lighter, and less expensive.
– But the video noise is higher.
– The more pixels on a chip (for high resolution) the higher the noise and
the lower the effective film speed.
– HDV cameras attempt to overcome video noise by delivering lower
resolution than they claim.
– HDV cameras use MPEG video compression in the camera to allow
storing a HD image on standard DV cassettes – but this degrades
Sony HVR-Z1U
• The “professional” Sony HDV camera claims to be 1080i.
– The sensor is 920 pixels horizontal by 1080 vertical, with one sensor for each
of 3 colors.
– The green sensor is offset from the other two by ½ a horizontal pixel, giving a
theoretical maximum resolution on a black and white image of 1440x1080.
• But the edge contrast is poor.
• To reduce video noise, adjacent vertical pixels are averaged together to
form each field.
– The result is low edge contrast in the vertical direction.
• MPEG compression further reduces resolution and contrast.
• The electronic image stabilization “steady shot” reduces the theoretical
resolution to ¼ of the available pixels.
– But you can (and must) turn it off.
• In practice, the resolution is about 1200 pixels horizontally by 800
vertically – similar to 720p – but the edge contrast is very low.
• The Sony camera delivers a low-contrast interlaced image.
– Viewing the image without deinterlace is quite unpleasant.
Note the “jaggies” on the
conductor’s hands.
Most digital displays
deinterlace by blending
fields, reducing the
vertical resolution to 540
pixels at best.
For best results, we must
use “smart”
deinterlacing, which only
blends pixels that are
different in each field.
Smart Deinterlace
• Sony Vegas editor does not include smart de-interlace.
But Mike Crash in
Czechoslovakia has
written a nice one (free).
Here is the same picture
de-interlaced by Mike
The picture has also been
sharpened by two unsharp
mask plug-ins.
Some increase in video
noise is visible – this is
less problematic when the
image moves.
• The low edge contrast in the Sony camera can be improved by
using two “unsharp mask” plug-ins – at 1440x1080 pixels.
The right side of
this image is
deinterlaced and
sharpened, the left
side is direct from
the camera.
I use two masks in
series, the first
with amount 1, the
second with
amount 0.5.
Test Patterns
500 lines horiz.
800 lines horizontal
500 lines horiz. 800 lines horizontal
Raw data from camera expanded 2x Same data after sharpening with two unsharp
note the pixels are not square
masks, amount 1 and amount 0.5 Notice the
excess sharpening at 500 lines.
The sharpening takes time…
• The cost of sharpening the video is computer time.
• The unsharp mask works in two dimensions, and increases the
contrast between adjacent pixels, one pixel at a time.
• When combined with smart deinterlace, the calculations can
take more than two seconds per frame.
• It takes about a week of computer time to sharpen the Verdi
requiem on a 3GHz Pentium 4.
• Sony Vegas must be set for the native camera resolution of
1440x1080 for the sharpening operation.
• USA HDTV and the Sony camera is 1080i
– This uses a rectangular pixel, 1.33x1
– And has 1440 pixels horizontal, 1080 vertical
• Most displays use a square pixel, and have lower
– Most of the best current displays are 1280x720, with a
square pixel.
• Thus the image must be scaled before it can be
• Scaling is similar to a sample rate conversion.
– The sample rate is the number of lines.
Sample rate conversion
• We know how to sample rate convert audio:
– Up-sample to a multiple of the current sample
– Low pass filter to smoothly fill in the missing
– Interpolate to a multiple of the new sample rate
– Low pass filter and down-sample.
• This process is far to complex to use for video.
– Standard video scaling introduces many artifacts
Scaling artifacts
Down-sample the sharpened data
from 1080i to 720p
The same down-sampling, but
adding a single unsharp mask
Notice that the loss of resolution and edge contrast is partly or mostly restored
by adding a single minimum radius unsharp mask.
The scalers in displays (and in Sony Vegas) do not do this. We have to do it in
a second pass during the rendering process. This adds more computer time!
Screen size
• A high resolution image is of little use if the screen is
small and far away.
– Even if the viewer can perceive the detail, the emotional power
may be lost.
• For audio/video a minimum screen size fills the distance
between the front loudspeakers, or +-30 degrees.
– Research by Kimio Hamasaki at NHK shows that larger screens
can be even better.
• Alas – such a large screen may be uncomfortable to
watch with standard DVDs.
– Current cinematography assumes a smaller screen.
• It is not clear how to resolve this dilemma.
– But with larger screens at lower prices, we may see a shift in
how movies are made and viewed.
• It is possible to make videos of music performances that re-create the
excitement and involvement of a live performance.
– The sonic goal is to capture the direct sound of each instrument clearly and
with low leakage between sections.
– Video with minimalist cinematography can be very effective when the
resolution is sufficient to capture the emotions of the performers.
– For a string quartet DVD quality video is adequate
• But for large forces much higher resolution – and larger screen sizes – are needed.
– Current technology is just barely able to do the job – but care must be taken
with every step – including:

similar documents