Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion error when using posit2 #359

Open
rkriemann opened this issue Sep 15, 2023 · 6 comments
Open

Conversion error when using posit2 #359

rkriemann opened this issue Sep 15, 2023 · 6 comments

Comments

@rkriemann
Copy link

Hi,

I'm experimenting with the posit2 implementation in an approximate storage setup (no arithmetic needed) and the error when converting from double/float seems to be way off.

Example:

#include <iostream>
#include <universal/number/posit2/posit.hpp>
int main ( int, char ** ) {
    auto  d1 = float( 1.0 );
    auto  p1 = sw::universal::posit< 16, 2 >( d1 );
    auto  d2 = float( p1 );
    std::cout << d1 << " / " << p1 << " / " << d2 << std::endl;
}

with output

rhs = 1 : significant = 0.5
real fbits : 23
 >>> (+,   0, 0b0'0.100'0000'0000'0000'0000'0000) : 0.5 vs 1
1 / 1.125 / 1.125

I'm using universal v3.72.1.e6ef6d76 with g++ v13.2

RGDS
Ronald

@ghost
Copy link

ghost commented Sep 16, 2023

@rkriemann Yes, that is to be expected: posit2 is still in development and not functional.

It is to replace posit when it is done, and it was never intended to be 'native' number system. I needed a non-clashing name to coexist in the same tree. posit2 is a multi-limb implementation so should be a lot faster than the bit-level implementation of the current posit class. All the other number systems in Universal are now limb-based, but posits are showing their age as it was the first number system implemented way back in 2017. In the posit tree, we used specializations to provide fast implementations of the standard posits so that they were useful in actual application codes. But that left the non-standard posit performance two orders of magnitude slower, and that is what the new limb-based implementation is trying to fix.

I don't have any time to complete the limb-based posit implementation, and I am looking for somebody interested to complete the work. Would you be interested to help out?

@rkriemann
Copy link
Author

Yes, that is to be expected: posit2 is still in development and not functional.

I thought so but had the hope that it at least works reasonably well for representing floating point numbers which is all I need (for now). In fact, I'm interested in the optimal storage of arrays of floats in a given precision, i.e., bitsize not necessarily a multiple of 8. Right now I have an IEEE754 derived scheme but was also experimenting with posits.

I don't have any time to complete the limb-based posit implementation, and I am looking for somebody interested to complete the work. Would you be interested to help out?

Time is scarce also on my side but I may be inclined to look into the conversion part and make sure that the resulting bits are identical to the standard posit implementation (assuming that makes sense). However, I can not guarantee anything :-( .

RGDS
Ronald

@ghost
Copy link

ghost commented Sep 19, 2023

the cfloat<> is limb-based and represents classical floating-point formats. If all you need is floating-point numbers that will do the trick.

It is sad to say but the posit is the only number system that isn't optimized to be limb-based, all the other types are.

The community has been asking for types like FP8, FP16, TensorFloat, and BF16 as well as the more advanced lns and dbns but nobody has been asking for posits.

@rkriemann
Copy link
Author

cfloat<> is basically what I already have, with additional storage optimizations for arrays of floats down to the bit level (minimal memory being the main objective!). But thanks anyway. I did not thought about using it before (always only used posits from the universal library ;-).

@ghost
Copy link

ghost commented Sep 19, 2023

@rkriemann if you have a reference to compare cfloat<> I would love to hear what you think of that implementation. The complete enumeration of subnormal and supernormal configurations for non-saturating are fully tested. Non-saturating basically means that the encoding uses +-inf to clip on overflow. I am still working on fully qualifying the saturating configurations where we do not have overflow but instead 'saturate' to maxpos/maxneg. This is a behavior that seems to be advantageous for DL and DSP applications.

I'd love to hear if you can leverage cfloat<> productively.

I was planning to add a fast bfloat16 to the mix that leverages the underlying FP32 hardware, but haven't finished that yet.

@rkriemann
Copy link
Author

I implemented a storage scheme based on cfloats and compared it to the closest existing implementation (aflp). Source code can be found at

https://gitlab.mis.mpg.de/rok/hlr/-/blob/master/include/hlr/utils/detail/cfloat.hh
https://gitlab.mis.mpg.de/rok/hlr/-/blob/master/include/hlr/utils/detail/aflp.hh

As for the application (hierarchical matrices): a given dense matrix is partitioned into many blocks (of different sizes) and for most blocks a lowrank approximation is performed and the dense data replaced by the approximation, thereby already introducing an error. The data in all blocks is then represented (independently) via cfloat/aflp with (minimal) precision bits chosen such that the overall error is not increased. The mantissa bits are then increased such that the total number of bits is a multiple of 8 for byte aligned storage. The result is a data representation with a very small memory footprint still permitting full matrix arithmetic (arithmetic is still done in FP64!).

The implementation in both cases is very similar: first determine the dynamic range for the exponent bits, then scale data to fit into chosen exponent and finally write truncated results into memory.

I picked a standard model problem with a matrix size of 524.288 x 524.288 with different errors (overall uncompressed memory is between 7.5GB and 20GB). Hardware is a 2-CPU AMD Epyc 9554. Compiler is GCC 12.1 (full optimization activated). Timings are median of 10 runs.

The compression ratio in both cases is identical as the chosen precision/exponent bits are equal.

Results

eps defines the error.

double -> aflp/cfloat speed in sec.

eps aflp cfloat
1e-3 3.085e-02 7.318e-02
1e-4 4.090e-02 9.313e-02
1e-5 5.075e-02 1.129e-01
1e-6 6.320e-02 1.322e-01
1e-7 7.846e-02 1.573e-01
1e-8 9.948e-02 1.976e-01

cfloat is about two times slower compared to aflp. I did not optimize aflp too much but tried to arrange everything such that the compiler is able to auto vectorize. Maybe this explains the difference.

aflp/cfloat -> double speed in sec.

eps aflp cfloat
1e-3 3.011e-02 2.087e-01
1e-4 3.965e-02 6.189e-01
1e-5 4.808e-02 8.056e-01
1e-6 6.246e-02 1.068e+00
1e-7 7.157e-02 1.324e+00
1e-8 8.928e-02 2.208e+00

Decompression speed should be similar (or smaller) compared to compression speed. However, cfloat is much slower here. This seems strange.

overall error vs uncompressed data

eps aflp cfloat
1e-3 3.8288e-04 1.6673e-04
1e-4 6.8759e-06 1.1229e-05
1e-5 1.7750e-06 7.6453e-07
1e-6 3.2525e-07 3.6779e-07
1e-7 1.1324e-08 1.0254e-08
1e-8 9.9146e-09 1.0200e-08

Only minor differences in the error (as was expected).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

No branches or pull requests

2 participants