针对渲染管线的优化算法

这部分是meshoptimizer针对渲染管线所提出的优化算法。

数据处理流程¶

When optimizing a mesh, you should typically feed it through a set of optimizations (the order is important!):
1. Indexing 2. (optional, discussed last) Simplification 3. Vertex cache optimization 4. Overdraw optimization 5. Vertex fetch optimization 6. Vertex quantization 7. (optional) Vertex/index buffer compression

当优化一个网格时，你通常需要通过一系列的优化操作（顺序是很重要的）：

建立索引
（可选）简化
顶点缓存的优化
过度绘制优化
顶点获取优化
顶点量化
（可选）顶点/索引缓冲区压缩

建立索引¶

Most algorithms in this library assume that a mesh has a vertex buffer and an index buffer. For algorithms to work well and also for GPU to render your mesh efficiently, the vertex buffer has to have no redundant vertices; you can generate an index buffer from an unindexed vertex buffer or reindex an existing (potentially redundant) index buffer as follows:

此库的大多数算法都要求网格有顶点缓冲区和索引缓冲区。为了让GPU有效地渲染你的网格，顶点缓冲区中必须没有多余的顶点。

你可以

从没有建立索引的顶点缓冲区当中建立索引
也可以从已有索引缓冲区（但有多余顶点）中重新建立索引缓冲区

size_t index_count = face_count * 3;
std::vector<unsigned int> remap(index_count); 
    // allocate temporary memory for the remap table
size_t vertex_count = meshopt_generateVertexRemap(
    &remap[0], 
    NULL, 
    index_count, 
    &unindexed_vertices[0], 
    index_count, 
    sizeof(Vertex));

Note that in this case we only have an unindexed vertex buffer; the remap table is generated based on binary equivalence of the input vertices, so the resulting mesh will render the same way. Binary equivalence considers all input bytes, including padding which should be zero-initialized if the vertex structure has gaps.

After generating the remap table, you can allocate space for the target vertex buffer (vertex_count elements) and index buffer (index_count elements) and generate them:

meshopt_remapIndexBuffer(
    indices, 
    NULL, 
    index_count, 
    &remap[0]);
meshopt_remapVertexBuffer(
    vertices, 
    &unindexed_vertices[0], 
    index_count, 
    sizeof(Vertex), 
    &remap[0]);

You can then further optimize the resulting buffers by calling the other functions on them in-place.

顶点缓存优化¶

When the GPU renders the mesh, it has to run the vertex shader for each vertex; usually GPUs have a built-in fixed size cache that stores the transformed vertices (the result of running the vertex shader), and uses this cache to reduce the number of vertex shader invocations. This cache is usually small, 16-32 vertices, and can have different replacement policies; to use this cache efficiently, you have to reorder your triangles to maximize the locality of reused vertex references like so:

当GPU渲染网格时，它会为所有顶点运行一次顶点着色器。通常GPU有一个内置固定大小的缓存来存储转换后的顶点（运行顶点着色器的结果），并使用这个缓存来减少调用顶点着色器的次数。

这个缓存通常很小，只有16-32个顶点，可以有不同的替换策略；为了有效地使用这个缓存，你必须重新排列你的三角形，以最大化重用顶点引用的局部性，如下所示:

meshopt_optimizeVertexCache(indices, indices, index_count, vertex_count);

过度绘制优化¶

After transforming the vertices, GPU sends the triangles for rasterization which results in generating pixels that are usually first ran through the depth test, and pixels that pass it get the pixel shader executed to generate the final color. As pixel shaders get more expensive, it becomes more and more important to reduce overdraw. While in general improving overdraw requires view-dependent operations, this library provides an algorithm to reorder triangles to minimize the overdraw from all directions, which you should run after vertex cache optimization like this:

在转换顶点之后，GPU发送三角形进行光栅化，先生成用于深度测试的深度图，通过深度测试的像素再执行片元着色器，以生成最终的颜色。片元着色器的开销很大，如何减少不必要的绘制很重要。

虽然减少不必要的绘制，通常需要依赖于视图。但这个库提供了一个算法来重新排序三角形，从而最小化各个方向上的过度绘制。你可以在顶点缓存优化之后运行这个算法：

meshopt_optimizeOverdraw(
    indices, 
    indices, 
    index_count, 
    &vertices[0].x, 
    vertex_count, 
    sizeof(Vertex), 
    1.05f);

The overdraw optimizer needs to read vertex positions as a float3 from the vertex; the code snippet above assumes that the vertex stores position as float x, y, z.

When performing the overdraw optimization you have to specify a floating-point threshold parameter. The algorithm tries to maintain a balance between vertex cache efficiency and overdraw; the threshold determines how much the algorithm can compromise the vertex cache hit ratio, with 1.05 meaning that the resulting ratio should be at most 5% worse than before the optimization.

顶点获取优化¶

After the final triangle order has been established, we still can optimize the vertex buffer for memory efficiency. Before running the vertex shader GPU has to fetch the vertex attributes from the vertex buffer; the fetch is usually backed by a memory cache, and as such optimizing the data for the locality of memory access is important. You can do this by running this code:

在最终的三角形顺序建立之后，我们仍然可以优化顶点缓冲区以提高内存效率。在运行顶点着色器之前，GPU必须从顶点缓冲区中获取顶点属性；读取通常由内存缓存支持，因此优化内存访问位置的数据非常重要。你可以通过运行下面的代码来实现:

To optimize the index/vertex buffers for vertex fetch efficiency, call:

为了优化索引/顶点缓冲区的顶点获取效率，调用:

meshopt_optimizeVertexFetch(
    vertices, 
    indices, 
    index_count, 
    vertices, 
    vertex_count, 
    sizeof(Vertex));

This will reorder the vertices in the vertex buffer to try to improve the locality of reference, and rewrite the indices in place to match; if the vertex data is stored using multiple streams, you should use meshopt_optimizeVertexFetchRemap instead. This optimization has to be performed on the final index buffer since the optimal vertex order depends on the triangle order.

这将重新排序顶点缓冲区中的顶点，以尝试改善访问局部性，并重写索引在适当的地方匹配。

Note that the algorithm does not try to model cache replacement precisely and instead just orders vertices in the order of use, which generally produces results that are close to optimal.

顶点量化¶

To optimize memory bandwidth when fetching the vertex data even further, and to reduce the amount of memory required to store the mesh, it is often beneficial to quantize the vertex attributes to smaller types. While this optimization can technically run at any part of the pipeline (and sometimes doing quantization as the first step can improve indexing by merging almost identical vertices), it generally is easier to run this after all other optimizations since some of them require access to float3 positions.

为了在进一步提取顶点数据时优化内存带宽，并减少存储网格所需的内存量，将顶点属性量化到较小类型通常是有益的。虽然这种优化在技术上可以在管道的任何部分运行(有时作为第一步进行量化可以通过合并几乎相同的顶点来改进索引) ，但是在所有其他优化之后运行这种优化通常更容易，因为其中一些优化需要访问 float3位置。

Quantization is usually domain specific; it's common to quantize normals using 3 8-bit integers but you can use higher-precision quantization (for example using 10 bits per component in a 10_10_10_2 format), or a different encoding to use just 2 components. For positions and texture coordinate data the two most common storage formats are half precision floats, and 16-bit normalized integers that encode the position relative to the AABB of the mesh or the UV bounding rectangle.

The number of possible combinations here is very large but this library does provide the building blocks, specifically functions to quantize floating point values to normalized integers, as well as half-precision floats. For example, here's how you can quantize a normal:

unsigned int normal =
    (meshopt_quantizeUnorm(v.nx, 10) << 20) |
    (meshopt_quantizeUnorm(v.ny, 10) << 10) |
     meshopt_quantizeUnorm(v.nz, 10);

and here's how you can quantize a position:

unsigned short px = meshopt_quantizeHalf(v.x);
unsigned short py = meshopt_quantizeHalf(v.y);
unsigned short pz = meshopt_quantizeHalf(v.z);

顶点、索引缓冲区压缩¶

In case storage size or transmission bandwidth is of importance, you might want to additionally compress vertex and index data. While several mesh compression libraries, like Google Draco, are available, they typically are designed to maximize the compression ratio at the cost of disturbing the vertex/index order (which makes the meshes inefficient to render on GPU) or decompression performance. They also frequently don't support custom game-ready quantized vertex formats and thus require to re-quantize the data after loading it, introducing extra quantization errors and making decoding slower.

在存储大小或传输带宽非常重要的情况下，您可能需要额外压缩顶点和索引数据。虽然有一些网格压缩库，比如 Google Draco，可以使用，但是它们通常是为了最大化网格的压缩比而设计的，代价是扰乱顶点/索引顺序(这使得网格效率低下，无法在 GPU 上渲染)或者解压缩性能。他们也经常不支持自定义游戏就绪的量化顶点格式，因此需要重新量化加载后的数据，引入额外的量化错误，使解码更慢。

Alternatively you can use general purpose compression libraries like zstd or Oodle to compress vertex/index data - however these compressors aren't designed to exploit redundancies in vertex/index data and as such compression rates can be unsatisfactory.

或者，你也可以使用通用的压缩库，比如 zstd 或者 Ooodle 来压缩顶点/索引数据——然而，这些压缩器并不是为了利用顶点/索引数据中的冗余而设计的，因此压缩率可能并不令人满意。

To that end, this library provides algorithms to "encode" vertex and index data. The result of the encoding is generally significantly smaller than initial data, and remains compressible with general purpose compressors - so you can either store encoded data directly (for modest compression ratios and maximum decoding performance), or further compress it with zstd/Oodle to maximize compression ratio.

为此，这个库提供了“编码”顶点和索引数据的算法。编码的结果通常比初始数据要小得多，而且通用压缩器仍然可以压缩——所以你可以直接存储编码数据(为了适度的压缩比和最大的解码性能) ，或者用 zstd/odle 进一步压缩，以最大限度地提高压缩比。

Note: this compression scheme is available as a glTF extension EXT_meshopt_compression.

To encode, you need to allocate target buffers (preferably using the worst case bound) and call encoding functions:

std::vector<unsigned char> vbuf(meshopt_encodeVertexBufferBound(vertex_count, sizeof(Vertex)));
vbuf.resize(meshopt_encodeVertexBuffer(&vbuf[0], vbuf.size(), vertices, vertex_count, sizeof(Vertex)));

std::vector<unsigned char> ibuf(meshopt_encodeIndexBufferBound(index_count, vertex_count));
ibuf.resize(meshopt_encodeIndexBuffer(&ibuf[0], ibuf.size(), indices, index_count));

You can then either serialize vbuf/ibuf as is, or compress them further. To decode the data at runtime, call decoding functions:

int resvb = meshopt_decodeVertexBuffer(vertices, vertex_count, sizeof(Vertex), &vbuf[0], vbuf.size());
int resib = meshopt_decodeIndexBuffer(indices, index_count, &ibuf[0], ibuf.size());
assert(resvb == 0 && resib == 0);

Note that vertex encoding assumes that vertex buffer was optimized for vertex fetch, and that vertices are quantized; index encoding assumes that the vertex/index buffers were optimized for vertex cache and vertex fetch. Feeding unoptimized data into the encoders will produce poor compression ratios. Both codecs are lossless - the only lossy step is quantization that happens before encoding.

To reduce the data size further, it's recommended to use meshopt_optimizeVertexCacheStrip instead of meshopt_optimizeVertexCache when optimizing for vertex cache, and to use new index codec version (meshopt_encodeIndexVersion(1)). This trades off some efficiency in vertex transform for smaller vertex and index data.

Decoding functions are heavily optimized and can directly target write-combined memory; you can expect both decoders to run at 1-3 GB/s on modern desktop CPUs. Compression ratios depend on the data; vertex data compression ratio is typically around 2-4x (compared to already quantized data), index data compression ratio is around 5-6x (compared to raw 16-bit index data). General purpose lossless compressors can further improve on these results.

Index buffer codec only supports triangle list topology; when encoding triangle strips or line lists, use meshopt_encodeIndexSequence/meshopt_decodeIndexSequence instead. This codec typically encodes indices into ~1 byte per index, but compressing the results further with a general purpose compressor can improve the results to 1-3 bits per index.

The following guarantees on data compatibility are provided for point releases (no guarantees are given for development branch):

Data encoded with older versions of the library can always be decoded with newer versions;
Data encoded with newer versions of the library can be decoded with older versions, provided that encoding versions are set correctly; if binary stability of encoded data is important, use meshopt_encodeVertexVersion and meshopt_encodeIndexVersion to 'pin' the data versions.

Due to a very high decoding performance and compatibility with general purpose lossless compressors, the compression is a good fit for the use on the web. To that end, meshoptimizer provides both vertex and index decoders compiled into WebAssembly and wrapped into a module with JavaScript-friendly interface, js/meshopt_decoder.js, that you can use to decode meshes that were encoded offline:

// ready is a Promise that is resolved when (asynchronous) WebAssembly compilation finishes
await MeshoptDecoder.ready;

// decode from *Data (Uint8Array) into *Buffer (Uint8Array)
MeshoptDecoder.decodeVertexBuffer(vertexBuffer, vertexCount, vertexSize, vertexData);
MeshoptDecoder.decodeIndexBuffer(indexBuffer, indexCount, indexSize, indexData);

Usage example is available, with source in demo/index.html; this example uses .GLB files encoded using gltfpack.