# Questions tagged [avx2]

222 questions

1

votes

1

answer

54

Views

### Same AVX2 program yields different result in gcc & msvc

I'm trying to increase throughput of md5 hash using AVX2. I have used simd_md5 library provided by a github user..
On msvc2013 i get desired result for all 8 buffers but on linux when i run the same code only for first 4 buffers the result match & next four buffer somehow the result is shifted.
I ha...

1

votes

1

answer

45

Views

### Xcode Apple Clang enable avx512

In Xcode(Version 10.1 (10B61)), I used Macro as below to detect AVX512 support.
#ifdef __SSE4_1__
#error 'sse4_1'
#endif
#ifdef __AVX__
#error 'avx'
#endif
#ifdef __AVX2__
#error 'avx2'
#endif
#ifdef __AVX512__
#error 'avx512'
#endif
In default Build Settings, SSE4_1 is active, but avx, avx2 and is...

1

votes

1

answer

93

Views

### Vectorised addition for 2 short int vectors using AVX2

I am facing problem in performing addition operation on 2 short (16bit integer) vector types using the AVX2 instruction set.
I have built the code but am getting an error in the addition command, probably because of the wrong syntax.
I am creating 2 vectors with the following code:
short int si1[16...

1

votes

0

answer

188

Views

### How to run Keras models with AVX2 support

I need to measure execution time for prediction per image by running models such as mobilenet2 and densnet121.
First, I just run my python code with stock tensorflow.
start = time.time()
out = m.predict(val)
end = time.time()
print('model: densenet')
print ('time: ', (end - start)/batch_size)
where...

1

votes

0

answer

228

Views

### tensorflow-1.12.0rc1-cp27-cp27mu-linux_x86_64.whl is not a supported wheel on this platform

I installed tensor flow on Intel NUC with pip3
pip3 install --upgrade tensor flow
But got below error
2018-10-25 20:14:31.685641: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
cannot open X server
Af...

1

votes

0

answer

69

Views

### Convert array of eight bytes to eight integers

I am working with the Xeon Phi Knights Landing. I need to do a gather operation from an array of doubles. The list of indices comes from an array of chars. The gather operations are either _mm512_i32gather_pd or _mm512_i64gather_pd. As I understand it, I either need to convert eight chars to to eig...

1

votes

1

answer

138

Views

### What's the performance impact of exporting registers to stack?

I am working on some code that is meant to run on x86 in 32-bit mode. In that mode, I understand that I've got only 8 SIMD/AVX2-Registers (YMM0-7) to freely work with. However, some of my vector subroutines alone sometimes use more than that amount of registers simultainiously (meaning that they are...

1

votes

1

answer

659

Views

### AVX2 shift (16-bit) integers [duplicate]

This question already has an answer here:
Emulating shifts on 32 bytes with AVX
3 answers
Are there built-in instructions to perform both right and left shift operation for (16-bits) integer elements in AVX2?
Like the following examples:
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] --> [16,0,0,0,0,0,0...

1

votes

1

answer

70

Views

### how to vectorize a[i] = a[i-1] +c with AVX2

I want to vectorize a[i] = a[i-1] +c by AVX2 instructions. It seems its un vectorizable because of the dependencies. I've vectorized and want to share the answer here to see if there is any better answer to this question or my solution is good.

1

votes

1

answer

70

Views

### Remaining part calculations with modulus “%” and bitwise and “&” causes different results on AVX2

I am trying to convert float values to integer values by using Intel intrinsincs for AVX2. My simple code is as follows:
void convert_f2i(float *fin, int *iout, int iLen)
{
int i, index, iDiv8, iLeft;
int *iin1;
__m256 v0;
__m256i vi0;
iDiv8 = iLen/8;
for(i=0; i

1

votes

1

answer

185

Views

### AVX2 Streaming Stores Do Not Improve Performance

I have an AVX2 implementation of some workload.
I have determined that the vast majority of the execution time is occupied
by the memory loads and stores.
In an attempt to improve performance, I tried to change the conventional stores
to streaming (non-temporal) stores.
However, this change had litt...

1

votes

1

answer

1.1k

Views

### Saving the XMM register before function call

Is it required to save/push the any XMM registers to the stack before the assembly function call?
Because am observing the crash issue in my code with release mode for 64-bit development(Using AVX2). In debug mode its working fine. I tried with saving the content of the XMM8 register and restoring i...

1

votes

2

answer

556

Views

### AVX2 1x mm256i 32bit to 2x mm256i 64bit

Is there a normal way to converted from 1x __m256i with 32bit ints into 2x __m256i's filled with 64bit ints. I'm averaging data and my 32bit ints are overflowing. So i'd like to split the accumulator register into two 64 bit registers.

1

votes

1

answer

433

Views

### Why my AVX2 horizontal addition function is not faster than non-SIMD addition?

I implemented an inline function for adding all elements of a vector, but it's not faster than non-SIMD addition.
Declarations :
#define N 128
#define M N
int __attribute__(( aligned(32)))temp8[8];
__m256i vec;
int __attribute__(( aligned(32))) c_result[N][M];
These are my two ways for adding all i...

1

votes

2

answer

439

Views

### convert array of uint64_t to __m256i

I have four uint64_t numbers and I wish to combine them as parts of a __m256i, however, I'm lost as to how to go about this.
Here's one attempt (where rax, rbx, rcx, and rdx are uint64_t):
uint64_t a [4] = {rax,rbx,rcx,rcx};
__m256i t = _mm256_load_si256((__m256i *) &a);

1

votes

1

answer

167

Views

### C++ AVX2: Seg fault when accessing address within array of arrays

I am using AVX2 instructions to take a bitwise and operation between an array in an array, a 2D array called test, and a separate array called joined_pos. This is my code:
#include
#include
#include
#include
#include
#define ROUND_DOWN(x, s) ((x) & ~((s)-1))
int main()
{
// Create two aligned a...

1

votes

1

answer

228

Views

### What is the fastest way for adding the vector elements horizontally in odd order?

According to this question I implemented the horizontal addition this time 5 by 5 and 7 by 7. It does the job correctly but it is not fast enough.
Can it be faster than what it is? I tried to use hadd and other instruction but the improvement is restricted. For examlple, when I use _mm256_bsrli_epi...

1

votes

1

answer

72

Views

### How to convert from 32-bit to 16-bit unsigned integers in AVX2?

I use _mm256_cvtps_epi32() to convert from 8 floats to 8x32-bit integers. But the goal is to get to 16-bit unsigned integers. I have 2 vectors a0 and a1, each of __m256i type. What is the fastest way to pack them so that 16-bit equivalents of a0 get into the lower 128 bits of the result, and equival...

7

votes

0

answer

42

Views

### What's the difference between the XOR instructions “VPXORD”, “VXORPS” and “VXORPD” in Intel's AVX2

I see in AVX2 instruction set, Intel distinguishes the XOR operations of integer, double and float with different instructions. For Integer there's 'VPXORD', and for double 'VXORPD', for float 'VXORPS'
However, per my understanding, they should all be the same XOR operation on binary data. E.g., XOR...

6

votes

2

answer

182

Views

### Fastest precise way to convert a vector of integers into floats between 0 and 1

Consider a randomly generated __m256i vector. Is there a faster precise way to convert them into __m256 vector of floats between 0 (inclusively) and 1 (exclusively) than division by float(1ull

0

votes

3

answer

46

Views

### SIMD __m256i to __m256d cast results

I am trying to cast a SIMD integer variable into a double. But I can't see what the result of this operation will be.
Example:
int arr[8]={12345678,12333333,12344444,12355555,12366666,12377777,12388888,12399999};
__m256i temp = _mm256_load_si256((__m256i *) arr);
__m256d temp2 = _mm256_castsi256_pd...

0

votes

2

answer

21

Views

### Any chance to accelerate recurrent code with SIMD?

Consider the following code where a is a parameter array of float and s is an initially uninitialized result array of float:
s[n - 1] = mu * a[n - 1];
for (int j = n - 2; j >= 0; j--)
s[j] = mu * (a[j] + s[j + 1]);
return s;
Is there any chance to improve the performance of such recurrent code with...

1

votes

2

answer

1.3k

Views

### _mm256_loadu2_m128i intrinsic not available under g++?

I'm trying to use the AVX2 intrinsic _mm256_loadu2_m128i, but it seems g++ 4.8.2 doesn't have it.
Is there any way to get it?

1

votes

1

answer

1.5k

Views

### Assemble Error for AVX2

I've tried to compile a AVX2 program with gcc(g++). But it didn't work right.
#include
....
__m256i _vector256 = _mm256_loadu_si256((__m256i*)pin);
__m256i _vectorMask = _mm256_loadu_si256((__m256i*)mask_hbits);
_vector256 = _mm256_slli_epi32 (_vector256, AVX_LOGDESC); // AVX_LOGDESC == 4
__m256i...

1

votes

1

answer

436

Views

### Convert SSE matrix-vector multiplication code to AVX

I'm trying to convert my SSE function to AVX. The function does vector-matrix multiplication, here's my working SSE code:
void multiply_matrix_by_vector_SSE(float* m, float* v, float* result, unsigned const int vector_dims)
{
size_t i, j;
for (i = 0; i < vector_dims; ++i)
{
__m128 acc = _mm_setzero_...

3

votes

1

answer

378

Views

### How to use AVX2 in MASM/VS15?

problem: I write something like this (inside proc):
.CODE
myProc PROC
vpmovsxbd ymm0, qword ptr [rdx] ; rdx is ptr to array of 8 bytes
vcvtdqps ymm0, ymm0
ret
myProc ENDP
and masm complains with invalid instruction operands for first and syntax error : ymm0 for second.
I'm compiling for x64 using VS...

3

votes

1

answer

249

Views

### Intel broadwell uop fusion for AVX load/store instructions

I'm trying to identify a performance baseline for memory-bound vectorized loops. I'm doing this on an Intel Broadwell chip with AVX2 instructions in a 32byte aligned environment.
A baseline loop uses 8 YMM registers at a time to load from one location and nontemporally store to another:
%define ptr...

2

votes

1

answer

114

Views

### Why do those two high(64bx64b) functions give different results?

static __inline__ uint64_t mulhilo64(uint64_t a, uint64_t b, uint64_t* hip) {
__uint128_t product = ((__uint128_t)a)*((__uint128_t)b);
*hip = product>>64;
return (uint64_t)product;
}
I am trying to write following above using MULX intrinsics on AVX2 (more specifically BMI2). But they do not give the...

7

votes

1

answer

852

Views

### AVX2, How to Efficiently Load Four Integers to Even Indices of a 256 Bit Register and Copy to Odd Indices?

I have an aligned array of integers in memory containing indices I0, I1, I2, I3. My goal is to get them into a __m256i register containing I0, I0 + 1, I1, I1 + 1, I2, I2 + 1, I3, I3 + 1. The hard part is getting them into the 256 bit register as I0, I0, I1, I1, I2, I2, I3, I3, after which I can just...

19

votes

1

answer

2.9k

Views

### In what situation would the AVX2 gather instructions be faster than individually loading the data?

I have been investigating the use of the new gather instructions of the AVX2 instruction set. Specifically, I decided to benchmark a simple problem, where one floating point array is permuted and added to another. In c, this can be implemented as
void vectortest(double * a,double * b,unsigned int *...

6

votes

1

answer

1.2k

Views

### AVX2 sparse matrix multiplication

I'm trying to leverage the new AVX2 GATHER instructions to speed up a sparse matrix - vector multiplication. The matrix is in CSR (or Yale) format with a row pointer that points to a column index array which in turn holds the columns.
The C code for such a mat-vec mul does look like this:
for (int...

6

votes

2

answer

1.2k

Views

### Why doesn't Intel design its SIMD ISAs in a more compatible or universal way?

Intel has several SIMD ISAs, such as SSE, AVX, AVX2, AVX-512 and IMCI on Xeon Phi. These ISAs are supported on different processors. For example, AVX-512 BW, AVX-512 DQ and AVX-512 VL are only supported on Skylake, but not on Xeon Phi. AVX-512F, AVX-512 CDI, AVX-512 ERI and AVX-512 PFI are supported...

6

votes

2

answer

680

Views

### Shuffle elements of __m256i vector

I want to shuffle elements of __m256i vector.
And there is an intrinsic _mm256_shuffle_epi8 which does something like, but it doesn't perform a cross lane shuffle.
How can I do it with using AVX2 instructions?

19

votes

2

answer

1.8k

Views

### Haswell memory access

I was experimenting with AVX -AVX2 instruction sets to see the performance of streaming on consecutive arrays. So I have below example, where I do basic memory read and store.
#include
#include
#include
#include
const uint64_t BENCHMARK_SIZE = 5000;
typedef struct alignas(32) data_t {
double a[B...

6

votes

3

answer

828

Views

### Fastest Implementation of Exponential Function Using AVX

I'm looking for an efficient (Fast) approximation of the exponential function operating on AVX elements (Single Precision Floating Point). Namely - __m256 _mm256_exp_ps( __m256 x ) without SVML.
Relative Accuracy should be something like ~1e-6, or ~20 mantissa bits (1 part in 2^20).
I'd be happy if...

19

votes

4

answer

4.4k

Views

### AVX2 what is the most efficient way to pack left based on a mask?

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2?
I've seen in SSE where it was done like this:
(From:https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksso...

6

votes

1

answer

1.5k

Views

### Fastest way to unpack 32 bits to a 32 byte SIMD vector

Having 32 bits stored in a uint32_t in memory, what's the fastest way to unpack each bit to a separate byte element of an AVX register? The bits can be in any position within their respective byte.
Edit: to clarify, I mean bit 0 goes to byte 0, bit 1 to byte 1. Obviously all other bits within the b...

5

votes

3

answer

1.1k

Views

### Is it really efficient to use Karatsuba algorithm in 64-bit x 64-bit multiplication?

I work on AVX2 and need to calculate 64-bit x64-bit -> 128-bit widening multiplication and got 64-bit high part in the fastest manner. Since AVX2 has not such an instruction, is it reasonable for me to use Karatsuba algorithm for efficiency and gaining speed?

2

votes

1

answer

518

Views

### Is it ok to create big array of AVX/SSE values

I am parallelizing a certain dynamic programming problem using AVX2/SSE instructions.
In the main iteration of my calculation, I calculate column in matrix where each cell is a structure of AVX2 registers (_m256i). I use values from the previous matrix column as input values for calculating the curr...

5

votes

1

answer

386

Views

### Extract bits with SIMD

I want to extract 8 bits from a register variable __mm256i src with 8 position specified by another __mm256i offset which is composed of 8 integers.
For example: if offset is [1,3,5,21,100,200,201,202], I want to get 1st,3rd,5th,100th,200th,201st,202nd bits from src and pack them to a int8.
This que...