Tensor Comparison #150

GregHattJr · 2024-10-10T18:28:42Z

Description

Enable reading binary files on systems that did not produce them by addressing platform-specific issues in the current serialization process. Currently a number of issues (detailed below) prevent us from consuming these .bin files within the visualizer. In order to move forward we'll need to implement a solution against the source repository which generates these files.

Link to Existing POC PR

Issues

Device-Specific Storage
The current serialization system for DeviceStorage and MultiDeviceStorage depends on device-specific memory (e.g., GPU, FPGA). This is reflected in the following part of the code:
```
TT_THROW("Device storage isn't supported");
```
Serialization skips device-specific tensors entirely, meaning that device-related data cannot be deserialized on systems without the same hardware.
Memory Configuration
The MemoryConfig serialized data (e.g., TensorMemoryLayout, BufferType) assumes that the target system can reconstruct the memory architecture. This occurs when writing the memory configuration to the file:
```
output_stream.write(reinterpret_cast<const char*>(&layout), sizeof(Layout));
output_stream.write(reinterpret_cast<const char*>(&storage_type), sizeof(StorageType));
```
This layout may depend on the original system’s memory architecture, leading to potential deserialization issues on different systems.
DistributedTensorConfig
The DistributedTensorConfig specifies how tensors are distributed across multiple devices. The code handles this in the multi-device storage logic:
```
std::size_t num_buffers = storage.num_buffers();
output_stream.write(reinterpret_cast<const char*>(&num_buffers), sizeof(std::size_t));
```
This configuration is system-dependent and cannot be easily reproduced on another system without a similar device setup.
Device-Dependent Code Paths
The MeshDevice configuration depends on the system's multi-device setup. For instance:
```
if (device != nullptr) {
    tensor = tensor.to(device, memory_config);
}
```
This code assumes the presence of specific devices (e.g., MeshDevice), making it impossible to deserialize and map tensor data properly if such devices are absent.
Data Types
The code supports several data types, including hardware-specific ones like BFLOAT16. These types may not be available on all systems:
```
DataType data_type;
input_stream.read(reinterpret_cast<char*>(&data_type), sizeof(DataType));
```
System dependency arises here, as some platforms may lack support for certain data types, leading to deserialization errors.
Tensor Layout
The code serializes tensor layouts (e.g., ROW_MAJOR, TILE), which may be optimized for certain hardware architectures. This layout is read and written as:
```
input_stream.read(reinterpret_cast<char*>(&layout), sizeof(Layout));
```
If the target system has a different memory architecture, it may not be able to reconstruct the tensor layout correctly.
Device Context During Deserialization
Deserialization relies on device context to check if tensors are stored on a device and then transfers them to CPU memory:
```
tensor = tensor.to(device, memory_config);
```
Without the necessary devices, this part of the code cannot function properly, leading to deserialization failures on systems without similar hardware.
Version-Specific Serialization
The code includes version checks to ensure compatibility between different serialization versions:
```
if (version_id >= 2) {
    input_stream.read(reinterpret_cast<char*>(&has_memory_config), sizeof(bool));
}
```
Mismatched versions between the writing and reading systems could result in failed or incorrect deserialization.
Custom Buffers and Memory Management
The custom buffer types OwnedBuffer and BorrowedBuffer manage memory during serialization. The buffer sizes are system-dependent:
```
output_stream.write(reinterpret_cast<const char*>(&size), sizeof(size));
```
These custom buffers may not translate well across systems with different memory architectures, leading to issues during deserialization.
Endianness and Platform-Specific Binary Formats
The binary format relies on system-specific properties like endianness, which are not handled explicitly in the current code:
```
output_stream.write(reinterpret_cast<const char*>(&size), sizeof(size));
```
This could cause byte-swapping issues when reading binary files on systems with different endianness.

Proposal

Write All Tensors in Host Independent Format

Given that we can not read the .bin file on a different host system we need to store the tensor in a host-independent format. Currently the database logic has a conditional that will write the tensor either to the custom TTNN tensor format (and produce a .bin file) or will simply write the tensor using PyTorch's save method.

def store_tensor(report_path, tensor):
    import torch

    tensors_path = report_path / TENSORS_PATH
    tensors_path.mkdir(parents=True, exist_ok=True)
    if isinstance(tensor, ttnn.Tensor):
        tensor_file_name = tensors_path / f"{tensor.tensor_id}.bin"
        if tensor_file_name.exists():
            return
        ttnn.dump_tensor(
            tensor_file_name,
            ttnn.from_device(tensor),
        )
    elif isinstance(tensor, torch.Tensor):
        tensor_file_name = tensors_path / f"{tensor.tensor_id}.pt"
        if tensor_file_name.exists():
            return
        torch.save(torch.Tensor(tensor), tensor_file_name)
    else:
        raise ValueError(f"Unsupported tensor type {type(tensor)}")

Unfortunately simply saving the tensors as .pt is not enough to allow for reading them on a different host. The tensors need to be detached and converted to a non-host specific memory using the CPU method.

def store_tensor(report_path, tensor):
    import torch

    DETACH_SAVED_TENSORS = True  # TODO Read from a configuration

    tensors_path = report_path / TENSORS_PATH
    tensors_path.mkdir(parents=True, exist_ok=True)
    if isinstance(tensor, ttnn.Tensor):
        if DETACH_SAVED_TENSORS:
            tensor_file_name = tensors_path / f"{tensor.tensor_id}.pt"
        else:
            tensor_file_name = tensors_path / f"{tensor.tensor_id}.bin"

        if tensor_file_name.exists():
            return

        if DETACH_SAVED_TENSORS:
            torch_tensor = ttnn.to_torch(tensor)
            torch_tensor = torch_tensor.detach().cpu()
            torch.save(torch_tensor, tensor_file_name)
        else:
            ttnn.dump_tensor(
                tensor_file_name,
                ttnn.from_device(tensor),
            )
    elif isinstance(tensor, torch.Tensor):
        tensor_file_name = tensors_path / f"{tensor.tensor_id}.pt"
        if tensor_file_name.exists():
            return
        torch_tensor = torch.Tensor(tensor)
        if DETACH_SAVED_TENSORS:
            torch_tensor = torch.Tensor(tensor).detach().cpu()
        torch.save(torch_tensor, tensor_file_name)
    else:
        raise ValueError(f"Unsupported tensor type {type(tensor)}")

The text was updated successfully, but these errors were encountered:

This was referenced Nov 1, 2024

Read / Compare Tensor Files #205

Closed

Tensor Comparison API #222

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor Comparison #150

Tensor Comparison #150

GregHattJr commented Oct 10, 2024 •

edited

Loading

Tensor Comparison #150

Tensor Comparison #150

Comments

GregHattJr commented Oct 10, 2024 • edited Loading

Description

Issues

Proposal

Write All Tensors in Host Independent Format

GregHattJr commented Oct 10, 2024 •

edited

Loading