Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intrinsic support #109

Open
akielaries opened this issue Feb 13, 2024 · 4 comments
Open

Intrinsic support #109

akielaries opened this issue Feb 13, 2024 · 4 comments
Assignees
Labels
v1.x Goals for v1.0.0 stable release

Comments

@akielaries
Copy link
Owner

akielaries commented Feb 13, 2024

So far intrinsics are only seen in mtx.cpp and vector.cpp. In the latter, look at the pieces of duplicated code and possibly create functions for these. Notice loops are blocked by a specific number that takes register width and data type into account for each ISA supported, some preprocessor macros like defines or even typedefs could probably be created for all of these "magic numbers" but they are mostly intuitive. For example:

#ifdef __AVX2__

// instruction set specific int
typedef iss_int m256i

// instruction set specific iteration size

// signed 8 bit int
#define ISS_I8_ITER 32

// signed 16 bit int
#define ISS_I16_ITER 16

#ifdef __AVX__

typedef iss_int m128

etc?

Overall there's a lot of conditional compilation in the two files so make it as clean as possible and less duplication

@akielaries akielaries added the v1.x Goals for v1.0.0 stable release label Feb 13, 2024
@akielaries akielaries self-assigned this Feb 13, 2024
@akielaries
Copy link
Owner Author

This has been somewhat fixed where files exist for specific types and intrinsic ISAs.

Next look into why the functions we have are so embarrassingly slow. Comparisons of our functions using intrinsics vs naive implementations with 3 nested loops sometime show no performance increase and in some cases the naive function performs better. Beyond just blocking and stuffing registers with values there have to be some better ways to optimize this code

@akielaries
Copy link
Owner Author

The reason for this could be a few things. Cache alignment has only been monitored on some functions but this must be a contributor and just memory access in general. Here is the new place with matrix/vector operations:

BY DEFAULT:
Routines that are BLAS inspired using their naming conventions (i.e. DGEMM = Double precision GEneral Matrix-Matrix product). These will most likely be big enough for their own files where we will have some of our own naming conventions. We want to make sure there is support for arrays and vectors to start

@akielaries
Copy link
Owner Author

There are Double, Float, and int implementations for GEMM routines under the linalg/ module. Lots of reused code while some is actually different depending on our types. Look into this for eliminating code duplication

@akielaries
Copy link
Owner Author

SGEMM implementation for single precision (float) implementation mismatches the naive implementation by quite a bit causing the test cases to fail due to being outside of a 0.01 threshold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.x Goals for v1.0.0 stable release
Projects
None yet
Development

No branches or pull requests

1 participant