WGDC 2000 - DirectPlay Voice

Voice comms for multiplayer games

Feature by Gestalt Contributor

Published on 10 Jul 2000

With the recent boom in internet gaming, it's perhaps no surprise that Microsoft have seen fit to overhaul "DirectPlay" for the new version of their DirectX API.

DirectPlay has, until now, been one of the less used aspects of DirectX, with the developers of most hardcore games choosing to code their own multiplayer support from the ground up rather than using DirectPlay. But with DirectPlay 8 Microsoft are hoping all that will change, improving support while adding several new features, including DirectPlay Voice.

The Basics

DirectPlay Voice (DPV) is based on the old Battlefield Communicator software, whose creators Shadow Factor have since been absorbed into Microsoft's evil empire. In fact, the man presenting the DirectPlay lectures at WGDC 2000 was none other than Shadow Factor's Paul Newson.

DPV is an API to allow voice communications over the internet to be used in multiplayer games, and it integrates with DirectPlay and DirectSound to work seamlessly with games using them. One of the benefits of this is that, for example, you can use the new DirectSound effects on your voice communications in some circumstances. But we'll get into that later...

Voice communications with DPV are activated in one of two ways - "Voice Activation" (VA), where the computer picks up the sound of your voice and only transmits data when you are talking, and "Push To Talk" (PTT), where you have to press or hold down a button when you want to talk. You can also combine the two methods to get the best of both worlds - the user can control when other players will be able to hear them, but they only transmit data when they are actually saying something.

DPV also sports "Automatic Gain Control" (AGC - is that enough acronyms for you yet?), which simply means that the computer tries to keep the volume of your speech more or less constant - if you start shouting it will turn down the gain on your microphone, and then return it to normal when you calm down again.

Compression

Just sending your voice over the internet as a streaming wave file is obviously not on, as it would rapidly consume your bandwidth. So DPV comes with a whole range of audio "codecs" to compress your speech, and the developer can choose which fits their game best.

Compression means that your speech will take up anything from 64kbit/s down to a measly 1.2kbit/s, depending on the available bandwidth and how important the sound quality is. This means that even if you are playing over an old-fashioned analogue modem, you can still take advantage of voice comms without worrying about it bringing your game stuttering to a stand-still.

In a worst case scenario though (where everybody in the game is shouting at each other at once), you could still be facing one incoming and one outgoing stream for every other player in the game. In a normal (peer to peer) game, you have to transmit your voice to all of the other players when you speak. Similarily, you will be receiving your opponents' voices seperately, with each player taking up some of your precious bandwidth whenever they talk.

In some cases this isn't a problem. If you only have a few players in the game, or if the game itself doesn't use much bandwidth and isn't effected by lag (eg, a card game), then peer to peer works fine. It can also work fine in team games such as flight sims and other military games, where players are only talking to their team-mates, and don't tend to engage in random radio chatter.

Forwarding

For larger games this is clearly not a reasonable approach though, however much you compress the players' voices. Luckily DPV offers some alternative ways of doing things...

The most general is the "Forwarding Server". Instead of transmitting your speech seperately to every other player in the game, you now have a central server (eg, the dedicated server running the game in the case of a first person shooter) which all the players transmit their voices to. This central server then forwards the voice streams to the appropriate players.

The advantage of this method is that you now only send a single stream when you speak, although you still receive seperate streams every time other players talk to you. But in most cases this shouldn't be too much of a problem, as if you had more than a few players talking at once you wouldn't be able to make out a word they were saying anyway...

The disadvantage is that you now need a central server, with lots of bandwidth to transmit voice streams to all the players. Sending seperate streams for each player's voice does have its advantages though - as the user's computer receives the voices seperately, it can also seperately process them for positional audio and other effects.

Special Effects

As we pointed out earlier, DPV can be integrated with DirectSound and DirectPlay. This means that you can get a player's location from DirectPlay, and then take the voice stream and apply 3D positional audio effects to it with DirectSound.

For example, if another player runs in front of you while they are talking, you can hear the sound of their voice moving from left to right as they pass by you. You can also use attenuation effects, so that the further away from you a player is, the quieter their voice sounds to you.

More complex effects can also be added. Imagine playing some sort of multiplayer special forces game, where your team is equipped with radio head-sets. You could start by applying a static effect to the incoming radio communications, and then play them back to a single speaker to simulate the effect of having a headphone in one ear.

If the player who is talking is close to you, you could also use a positional audio effect so that you can hear their character talking into their microphone. Depending on the kind of location you are in, you might also want to add a reverb or echo effect to that speech, whilst leaving the radio sound unchanged.

All of this is possible whether you are using a peer to peer or a forwarding server arrangement for your voice communications, and it could potentially add a lot of extra realism to many types of hardcore multiplayer game, as well as adding a whole new dimension to teamplay games and military style simulations.

In The Mix

The third and final way of handling voice comms in DPV is using a "mixing server". Again, you only send one stream to the central server, but in this case the server also mixes all the appropriate streams together for each user, and sends it out as a single stream.

The obvious advantage of this is that now each player is only sending and receiving one stream at a time, greatly reducing the potential bandwidth requirements for them, as well as reducing the bandwidth required from the server compared to the forwarding server method.

The downside is that mixing all those audio streams on the server requires it to have a lot of processing power. And because you are now only sending a single stream to each player, the game can no longer use positional audio and other sound effects on the voice comms.

In theory you could do the processing on the server, but DPV doesn't allow this as not only would it require even more CPU power from the server, but also when the server compressed the stream to send it out to the player it would ruin many of the effects you were trying to achieve.

The only other problem is that you are compressing and decompressing the voice streams twice. The player compresses their voice stream, sends it to the server, the server decompresses it, mixes it with other voice streams, compresses it again, and then sends it out. This can obviously reduce audio quality if you are using a low bandwidth audio codec.

If you have the processings power on the server though, this is the best way of handling games with large numbers of players, particularly when the game itself requires a lot of bandwidth. The players are not mixing or transmitting multiple streams, and so their bandwidth and CPU requirements are lower than for other methods.

Buffer

That's not the end of the cleverness though... DPV also features an "adaptive buffer" which collects incoming voice streams and accumulates them for you to maintain the optimal combination of smoothness and low latency.

On a poor internet connection with bad lag spikes, the buffer will be fairly long. This delays the voice streams from being played back on your computer, introducing some added latency, but it does mean that if you suffer a short burst of lag the voice playback will continue to smoothly play back from the buffer while your connection recovers.

If you are on a LAN though, the buffer will be virtually non-existent, allowing the voice stream to play back with negligible latency, as you don't have to worry about losing your connection while you are playing.

There is also a DPV setup routine for games, which checks that the user has a duplex soundcard (capable of playing and recording sound at the same time), and that their microphone is plugged in and switched on, as well as checking the gain on the microphone so that when the game starts the volume is at least in the right ball park, ready for the adaptive gain controller to take over.

Conclusion

Voice comms has a whole variety of uses, from general online chat and casual gaming through to multiplayer flight sims, online RPGs, squad based shooters, and other team based games. After all, BattleComm was originally developed by a group of Quakers who wanted a better way of communicating during Capture The Flag sessions.

Now that technology is being built into DirectX, along with a whole range of royalty free compression codecs, allowing game developers to build support for voice comms into their games more easily and more cheaply. How that effects online gaming over the next few years, and how developers get around the inherent problems with voice communications should certainly be interesting...

WGDC 2000 - Future Of Windows Gaming

DirectX at ETC 2000