Wednesday, February 21, 2018

Age DE's latency matrix

Getting the netcode to work reliably in a peer to peer multiplayer title is tricky. Every peer must be able to quickly and reliably send and receive packets with every other peer, or the whole thing falls apart. Also, if any machine runs a turn slower than expected for any reason, the entire system will hitch while waiting for the slow peer to catch up. DE constantly monitors the pings and framerates of all connections and machines in a MP game, but it can only compensate so much for bad connections.

In DE's game lobby there's a 2D matrix of blocks that shows the systemwide roundtrip latencies between all players (circled in red):


Once you're in a lobby, the game's peer to peer multiplayer code (parts of which date back to the original game) is active, and your machine is actively communicating with all the other machines. Every 4 seconds your machine pings all the other clients, the results are sent to the host, and every few seconds the host then sends the entire matrix to all peers.

For each row of this matrix, the latency to all the other players is visualized. So the first block on row 2 represents the latency from player 2 to player 1, and the third block on row 2 is the latency from player 2 to player 3, etc. Grey means no response (yet), green is <200ms ping, yellow is <=400ms, and red is >400ms. The game won't start if there are any grey blocks (even in "dedicated server" mode). The ping matrix is not necessarily symmetrical, but usually is.

If a block has a thin blue rectangle around it, that means that client has to use a TURN server relay to get its packets to the other client due to NAT traversal issues. This means extra overhead.

The latencies visualized here are low pass filtered over approx. 8 pings.

The "Ping" column shows the local roundtrip latencies to the other clients. Each player will have its own unique column of values. Apart from maybe the host, I think this column is kind of useless, because it's only showing local latencies. It would have been better if it displayed each player's worst latency.

This matrix is used to compute the turntimes used during the actual game. I believe all peer to peer titles should display something like this, to help players quickly see at a glance how healthy the connections are between peers.


Wednesday, February 14, 2018

Lessons learned while developing Age of Empires 1 Definitive Edition

In late 2016 I began helping Forgotten Empires on Age 1 DE, a UWP app shipping in the Windows Store on Feb 20th. I only helped occasionally for the first couple months or so (because I was working on Basis and an aerospace project), but as the title got closer to shipping I spent more and more of my time working on Age problems. We started with the original 20 year old Age 1 codebase. Here are some of the things I've learned:

1. Get networking and multiplayer working early.
DE supports both traditional peer to peer (with optional TURN server relaying to handle problematic NAT routers), and a new client-server like mode ("host command forwarding") where all clients send their commands to the host which are then forwarded to the other clients. Age 1 uses a lockstep simulation model, except for most AI code which is only executed on the host (see here).

Do not underestimate the complexity of lockstep peer to peer RTS multiplayer games. If possible, choose an already debugged/shipped low-level networking library so you can focus on higher-level game-specific networking problems.

If you do use an off the shelf network library, test it thoroughly to help build a mental model of how it actually works (vs. how you think it works or how it's supposed to work). Develop a test app you can send the library developers to reproduce problems. If the library supports reliable in-order messaging then (at the minimum) put sequence numbers in all of your packets and assert if the library drops, reorders or duplicates packets in case there are bugs in the reliable layer.

For debugging purposes make sure all timeouts can be increased by a factor of 10x or whatever. Sometimes, debugging real-time network code is impossible in the debugger (because it inserts long pauses), so be prepared to do a lot of printf()-style debugging on multiple machines.

If you're taking an old codebase and changing it to use a new networking library or API, try to (at first) minimize the amount of changes you make to the original code. No matter how ugly it is, the original code worked, was bug fixed and shipped, and don't underestimate the value of this.

If you develop your own reliable messaging system, develop a network simulator testbed (which simulates packet loss, etc.) to automate the validation of this layer whenever it's modified and always keep it working.

Trust nothing and verify everything at multiple levels. CRC your packets, CRC the uncompressed data if you use packet compression, use session nonces in your connection-oriented layer to validate connections, validate that your reliable layer is actually reliable, etc. Make sure the initial connection process is well defined and completely understood. Everything needs timeouts of some sort and when sending unreliable messages any packet can get lost. Gaffer on Games is a great guide to this domain of problems.

Getting the game to run smoothly with X random machines across a variety of network conditions is difficult. Plan on spending a lot of time tuning the system which controls the game's turntime (command latency and sim tick rate). There are multiple sources of MP hitches (which players hate): Turntime too low (so one or more machines can't keep up with the faster ones), random CPU spikes caused by AI/pathing/etc., reliable messaging retransmit delays, random client latency spikes, AI's sending too much command data, etc. Develop strong tools to track these problems down when they occur in the field and not in your test lab.

Add cheat commands to the game to help simulate a wide range of various networking and framerate conditions.

If you send unreliable ping/pong packets to measure roundtrip client latency, filter the results because some routers are quite noisy. The statistics that go into computing the sim tick rate and turntimes should be well filtered.

Establishing the initial connections between two random machines behind NAT's is still a challenging problem - test this early.

Identify your most important packets and consider adding some form of forward error correction to them to help insulate the system from packet loss. In lockstep designs like Age, the ALL_DONE packets sent by each client to every other client to indicate end of turn are the most important and currently sent twice for redundancy. (Excluding AI's, there are no commands from the player on most turns!)

Internal testing doesn't mean much. You must have MP betas to discover the real problems. It seems virtually impossible to simulate network conditions as they occur in the wild, or the game running on customer machines.  Make sure you get valuable test data back from MP betas to help diagnose problems.

Age DE's reliable messaging system is based on Brownlow's "A Reliable Messaging Protocol" in GPG 5. This is an elegant and simple NACK-based reliable protocol, except the retransmit method described in the article is not powerful enough and is sensitive to network latency (supporting only 1 packet retransmit request per roundtrip). We had to modify the system to support retransmit packets containing 64-bit bitmasks indicating which speciific packets needed to be resent.

2. Develop strong out of sync (OOS) detection tools early, and learn how to use them.
As a lockstep RTS codebase is modified you will introduce many mysterious and horrifying OOS problems. Don't let them smolder in the codebase, fix them early and fix new ones ASAP.

Functions which are not safe to use in the lockstep simulation should be marked as much. We had an accessor function which returned true if the entire map was visible, which got accidentally used in some code to determine if a building could be placed at a location. This caused OOS's whenever the user resigned (which locally exposes the entire map) and another client built walls. This little OOS took 2 days to track down.

If you are getting mysterious OOS's, you need to identify the initial cause of divergence and fix that, then repeat the OOS debugging process until no more divergences remain. Don't waste time looking at downstream effects (such as out of sync random number generators) - identify and fix that first divergence.

In Age, the original developers logged virtually everything they could in the lockstep sim. Some important events (such as where objects were being created) were left out, so we had to add unique "origin" parameters to all object creations so we knew where in the code objects were being created.

3. Do not underestimate the complexity and depth of UWP and Xbox Live development.
Your team will need at least 1-2 developers who live and breathe these platforms. These individuals are rare so you'll just have to bite the bullet and make an investment into these technologies.

4. Develop clean and defensive coding practices early on. Use static analysis, use debug heaps, pay attention to warnings, etc. Being sloppy here will increase your OOS rate and cause player and developer pain. Be smart and use every tool at your disposal.

5. Do not disable or break "old" logging code. Make sure it always compiles.
This logging code is invaluable for tracking down mysterious/rare problems and OOS's. The original developers put all this logging code in there for a reason..

6. Add debug primitives if the engine doesn't have any
This is a basic quality of life thing: You need the ability to efficiently render 2D text, debug primitives in the world, etc. If the engine doesn't support them then get them in early.

7. Profile early and make major engine architectural decisions based off actual performance metrics.
If your new renderer design relies on a specific way of rendering the game in a non-mainstream manner, then verify that your design will actually work in a prototype before betting the farm on it. Be willing to pivot to an alternate renderer design with better performance if your initial design is too slow.

Get perf. up early: Lockstep RTS multiplayer games can only tick the simulation at the rate of the slowest machine in the game. So if one machine is a dog and can only handle 20Hz, the game will feel choppy for everyone. Other major sources of perf problems like pathing or AI spikes will be obscured if rendering is running slow.

8. Figure out early on how to split up a singled threaded engine to be multithreaded.
Constraining an RTS to live on only a single thread is a recipe for performance disaster, especially if you are massively increasing the max map size and pop caps vs. the original title.

9. Many RTS systems rely on emergent behavior and are interdependent.
If you modify one of these systems, you MUST test the hell out of it before committing, and then be prepared to deal with the unpredictable downstream effects.

For example, modifying the movement code in subtle ways can break the AI, or cause it to behave suboptimally. The movement code in Age1 DE is like Starcraft's: an unholy mess. To be successful modifying code like this you must deeply understand the game and the entire system's emergent behavior.

Carelessly hacking the movement or path finding code in an RTS is akin to hacking the kernel in an OS: expect chaos.

10. Automated regression testing
The more you automate and objectify testing of movement, AI, etc. the happier your life will be and the easier you will sleep at night.

11. Playtest constantly and with enough variety
It's not enough to just play against AI's on the same map over and over. Vary it up to exercise different codepaths. You MUST playtest constantly to understand the true state of the title.

12. Assume the original developers knew what they were doing.
The old code shipped and was successful. If you don't understand it, most likely the problem is you, not the code.

For example, Age 1's original movement system has some weird code to accelerate objects as they moved downhill. This code didn't have a max velocity cap, so on very long hills units could move very quickly. We resisted modifying this code because it turns out it's a subtle but important aspect of combat on hills.

13. Don't waste time developing new templated containers and switching the engine to use them, but do reformat and clean up the old code.
Nobody will have the time to figure out your new fancy custom container classes, they'll just use std because we all know how they work.

Instead, spend that time making the old code readable so it can be enhanced without the developers going crazy trying to understand it: fix its formatting, add "m_" prefixes, etc.

Sunday, February 4, 2018

10 abusive company types

These categories were originally about abusive men, but my friend Stephanie noticed these categories could be adapted to describe abusive companies, too. From the book "Why Does He Do That?":

1. Drill Sergeant: Micromanages you, wants to control everything.

2. Mr. Sensitive: Builds up a public image of being a great company so people think you're crazy if you criticize them.

3. The Water Torturer: Is an expert at not doing anything OBVIOUSLY wrong, you feel wronged but can't pinpoint why and wonder if you're crazy.

4. The Demand Man (or Company): Everything seems fine if you never ask for anything, like a raise. If you do that, you're suddenly painted as ungrateful and treated poorly.

5. Mr. Right: Everything is fine so long as you don't question the company's actions or say anything critical about them.

6. The Player: Never lets you feel like the job is stable. Acts interested in you only to hook you in, then you're neglected and treated poorly again.

7. Rambo: Treats everyone like shit, but tells you you're special and an exception.

8. The Victim: You caused the company so much trouble, you really messed up that one time, any mistreatment happening to you now is making up for that.

9. The Terrorist: Reminds you of the power they have to ruin your career or life, so you better not go against them.

10. Bipolar: The company oscillates between being angry and then happy with you depending on the state of your current project. They become angry when a problem is identified, and when you fix it they are temporarily happy.


Friday, November 24, 2017

Universal GPU texture codec update

I've reduced the size of the ETC1->DXT1 lookup table to around 85KB, vs. the previous 3.75MB. There's a slight loss in quality (around .1 - .3 dB), but it's worth it. The larger table can still be used. The worse artifacts occur on very high contrast blocks. The size of this table is a baseline tax (especially on web) of using this codec, so it must be lightweight.

The previous conversion table was 4D, one dimension for each component of the ETC1 base color (5:5:5 bits) and a final dimension for the intensity value (3 bits). The new method is 2D: one dimension for the 5-bit component, and another for the intensity. There are two tables, one for R/B and another for G, because in DXT1 G is 6 bits and R/B are 5. There are some additional complexities, but that's the gist of it. The transcoder has to do a tiny bit of per-block work in this scheme to determine how to map the ETC1 selectors to DXT1 selectors, but it all boils down to some table lookups and adds.

The 85KB table can be precomputed, computed on the fly, or computed once at init.

Original:
ETC1 near-optimal:

ETC1S (the universal texture):
DXT1:
DXT5A:

Thursday, November 23, 2017

More universal GPU texture format examples

I've improved the quality of the ETC1S->DXT1 conversion process. All of these images come from the same exact ETC1 data. Only a straightforward transform is required on the compressed texture bits to derive the DXT1/DXT5A version. It's simple/fast enough to do in a Javascript transcoder.

ETC1:

DXT1:
DXT5A:

ETC1:

DXT1:
DXT5A:

ETC1:
DXT1:
DXT5A:
ETC1:
DXT1:
DXT5A:

ETC1:
 DXT1:
 DXT5A:

ETC1:

 DXT1:

 DXT5A:

ETC1:
DXT1:
DXT5A:

Universal GPU texture format: DXT5 support

Got grayscale ETC1 to DXT5A conversion working, using a small 32*8*3 entry table. This work is for DXT5 support in the universal texture format. Now that this is working I can proceed to finishing the full universal encoder. 

The groundwork is laid out and it's all downhill from here now. My main worry now is the ETC1S->DXT1 lookup table's size, which is currently around 3-4MB. It can be quickly computed dynamically at startup or on the fly as needed, or it can be precomputed into the executable.

Note none of these images were created with my best ETC1 encoder. They use an early prototype from late 2016 that has so-so quality. The main point of these experiments is to prove that efficiently converting ETC1 data to DXT1/5 is practical and looks reasonable. The encoder is now aware of DXT5A transcoding, but it is aware of the ETC1S->DXT1 transcoding (which helps a lot).

All stats are dB vs. the original image. This image's subtle gradients are hard to handle, you can see this in the DXT1 version.

To those who argue that a universal GPU texture format that is based off ETC1/DXT1 isn't high quality enough: You would be amazed at the low quality levels teams use with crunch/Basis. This tech isn't about achieving highest texture quality. It's about enabling easy distribution of supercompressed GPU texture data. It's a "JPEG-like format for GPU texture data", usable on mobile or desktop. 

Original

ETC1 near-optimal 48.903


ETC1S 46.322 (universal format base image in ETC1 mode)


ETC1S->DXT1 45.664


ETC1S green channel converted to DXT5A (43.878)


Original


ETC1 near-optimal 51.141



ETC1S 46.461

ETC1S->DXT1 44.865


ETC1S green channel converted to DXT5A 46.107


Wednesday, November 22, 2017

"Universal" GPU texture/image format examples

The DXT1 images were directly converted from the ETC1 (really "ETC1S" - a compatible subset with no subblocks) data using a straightforward lookup table to convert the ETC1 base color to the DXT1 low/high colors, and the selectors were remapped appropriately using a byte from the lookup table. The ETC1->DXT1 lookup table is currently 3.75MB, and can be computed on the fly very quickly (using a variant of ryg_dxt) or (for higher conversion quality) precomputed offline.

The encoder in these examples is still my old prototype from 2016. I'm going to be replacing it with Basis's much better ETC1S encoder next. This format can also support alpha/grayscale data.

This format is a tradeoff: for slightly reduced quality, you can distribute GPU textures to most GPU's on the planet. Encode once, use anywhere is the goal. We are planning on distributing free encoders for Linux and Windows (and eventually OSX but it's not my preferred dev platform).

The current intermediate format design supports none, partial or full GPU transcoding. Full transcoding on the GPU will only work on those GPU's that support LZ in hardware (or possibly a compute shader). The process of converting the ETC1S data to DXT1 and the block unpack to either ETC1 or DXT1 can also be done in a shader, or the CPU. By comparison, crunch's .CRN design is 100% CPU oriented. We'll be releasing the transcoder and format as open source. It's an LZ RDO design, so it's compatible with any LZ (or whatever) lossless codec including GPU hardware LZ codecs. It'll support bitrates around .75-2.5 bpp for RGB data (using zlib).

All PSNR figures are luma PSNR. The ETC1 were software decoded from the ETC1 block texture data (actually "ETC1S" because all 4x4 pixel blocks use 5:5:5 base colors with no subblocks, so the differential color is 0,0,0).

ETC1 41.233



DXT1 40.9


ETC1 45.964



DXT1 45.322


ETC1 46.461


DXT1 44.865

ETC1 43.785


DXT1 43.406

ETC1 33.516


DXT1 33.339