Originally published in Sound and Image in Nov/Dec 2008, v.22#2, pp.86-88
This problem no longer exists and we can put it down to early wrinkles in new technology. But a new version of it has reared its head, due to modern video technology. This affects each of us differently. Let us see why, and what can be done about it.
Let's say that I am standing next to you. If you speak to me, the sound of your voice, relative to my view of your lips, will be delayed by around one millisecond (one thousandth of a second). If I am standing in one corner of my office and you are standing in the other corner, about seven metres away, your voice will be delayed by about twenty milliseconds. Now twenty thousandths of a second may seem like too short a time for a person to notice any difference, but I have just this moment made an audio file with two clicks separated by twenty milliseconds, and they are clearly distinguishable to any ear.
Yet here's the odd thing: my brain will delay my vision of the movement of your lips so that it matches the sound of your voice. Likewise, your brain will do the same to your vision of my lips when I reply.
How far away we can get from each other before that capability evaporates depends upon each of us individually. Me? I am very insensitive to lip sync issues. I can be a very long way away indeed. For each of us there is some threshold of difference between sound and vision. As we approach that individual threshold, the two remain in sync in our brain, even though they are in reality further and further apart in time. Once each of us reaches our threshold, the whole alignment of the two simply collapses and they become distinct, unrelated sensations.
The capability also varies depending on what you are hearing. If I am in a field chopping wood and you walk up to, say, fifty metres away from me, we can have a (somewhat shouted) conversation. Even though the sound from my voice will be some fifteen one hundredths (0.15) of a second behind the vision, chances are your brain will slide one, temporally, with respect to the other so that my lips match the sound. But once we stop talking and I resume chopping wood, you will see the axe strike, and hear the distinct sound a short moment later.
That's the background on what our mental A/V processing circuits do in the real world, and it's a pretty impressive job. How does this relate to the pretend world of home theatre?
First, you have to understand that with home theatre the problem is actually the reverse of real world situations. Out in that field my voice follows the sight of my lips moving. With home theatre, the delay is usually imposed by the display device. That is, the sound is often actually ahead of the vision!
This is due to the design of modern displays. So let us briefly compare them with older displays.
A traditional CRT TV was an immensely complicated beast. It took an electrical signal, amplified it, used it to modulate a beam of electrons shooting down an evacuated glass tube, spraying them onto the coloured phosphors painted on its face. Complicated, but fast. There was almost no delay inherent in this process. Within a few microseconds of a section of the picture signal entering the TV, it would appear as a glowing phosphor. Each dot constituting the picture appears almost instantly after the TV received the information about that dot. The topmost, leftmost dot of phosphor on the screen would be illuminated with the start of a particular film frame. A row of dots would be painted horizontally across the screen, another row just below that, and so on, each. Eventually the bottommost, rightmost phosphor would appear -- nearly one 25th of a second after the frame began to be painted on the display. But that dot would, still, be just the smallest instant after the signal describing that do was received.
That was then. Now we use 'panel' displays. I use panel to refer not just to plasma and LCD displays, but also to front projectors. These also use panels, just rather small ones (whether they be LCD, DMD or LCoS panels). These do not lend themselves to having the image painted sequentially across them as a series of lines. This is for reasons that would take another page or two to explain, so please just take my word for it!
So instead of the image appearing virtually instantly, it must be delayed, even in the most basic of panel displays. What happens is the same kind of thing that happens with computers. Each pixel of the panel has a piece of computer memory associated with it. A value for brightness is poked into the byte (or more) of memory for that pixel, and the display pixel responds. There are two banks of memory, and only one is 'driving' the panel at a time. The one connected to the panel contains the previous frame. The current frame, being received from the input terminals of the display, is still filling up. The display won't be switched over to the bank of memory containing the new frame until it has been filled up. If it is a full progressive scan frame, that means that the memory will be filled nearly one twenty fifth of a second after the first bit of signal for that frame was received. So, on average, the picture will display one fiftieth of a second later than it would on a CRT display (the first pixel received will display one twenty fifth of a second later, the last pixel will be displayed at roughly the same time as it would have on a CRT).
Have you stuck with me so far? If you've kind of glided over that last paragraph, don't worry. The net result is that at its very quickest, a panel display will hang the video up with respect to the sound by one fiftieth of second, or twenty milliseconds.
But few displays are so fast. Some create the frames by motion adaptive algorithms. That means they have to examine sequential fields (a field is half a frame) to see which bits of the picture have moved over time. Each additional field examined adds at least another fiftieth of a second -- twenty milliseconds -- of additional delay. That is simply the time required to receive the extra video frame. There is also time involved in performing the millions of computations required.
Then there are the even more advanced processing systems offered by some displays, such as frame interpolation. These go under various names -- 'Natural Motion', '100Hz Motion Flow', and so forth -- but they work in a similar way to each other. They take two sequential frames from the video and compare them, and then calculate an intermediate frame half-way between the two. They insert this newly created frame in between the actual received frames, thereby doubling the apparent rate at which the frames were shot. This can smooth the appearance motion considerably in many scenes.
But all that processing takes time.
To deal with that, TVs incorporate delay mechanisms that buffer the sound long enough to bring it back into synchronisation with the video. All good modern home theatre receivers do as well, and usually support a delay of up to 200 milliseconds, or one fifth of a second.
Version 1.3 of HDMI introduced an automatic lip sync feature. If implemented, it allows the TV to tell a home theatre receiver how much delay is inherent in its video processing so that the receiver can automatically apply a suitable amount of delay to the sound.
That can be vitally important for enjoyment of your home theatre system. One TV that we look at in this issue imposed so much delay on its video that it required the home theatre receiver to respond with a full 200 milliseconds of delay to the sound.
If you speak at a normal speed and say the word 'Cat', it will take you about 200 milliseconds to utter the word. That is a long, long delay.
The almost unbelievable wonders that a modern display can perform in improving video quality demand that you use a home theatre receiver capable of applying commensurate delay to the sound.