Using the Keccak Code Package on Visual Studio

3139 views c
4

I downloaded the Keccak Code Package ( Now XKCP ) because I'm interested in all the features it brings and I'd like to use them in my "Project", that's how we'll call it.

The problem is that I'm programming my project using Microsoft Visual Studio Community 2017... I'll explain:

Actually, I'm trying to implement Keyak Lake in my project and more especially the Keyak Lake implementation that uses SIMD accelerations. But, the code that comes with the KCP ( Keccak Code Package ) is designed for GCC, not VS and therefore, that's a struggle to get it to compile on Visual Studio and mostly because of the (__m128d) and (__m128i) casts in the macros used in the Keyak Lake SIMD implementation. GCC allows those kind of casts but Visual Studio doesn't so the code doesn't work as is, you have to rework it using _mm_castpd_si128 and such...

So here is what I've tried: I replaced all the macros with their code equivalent using -E on GCC. In fact, I got the code after the preprocessor work so the macros are all unrolled, fully written and then I have replaced all the casts that were not accepted by Visual Studio with their intrinsic equivalents. And finally, I could get the code to compile on Visual Studio but it still doesn't run fine.

Here's the error:

void KeccakP1600_Permute_12rounds(void *state)
{    
    //All the variables like Abae, Cae, Akimo etc... are ALL __m128i variables
    //state is an unsigned char[200]

    UINT64 *stateAsLanes = (UINT64*)state;
    Abae = _mm_load_si128((const __m128i *)&(stateAsLanes[0]));
    Aba = Abae;
    Abe = _mm_unpackhi_epi64(Abae, Abae);
    Cae = Abae;
    Abio = _mm_load_si128((const __m128i *)&(stateAsLanes[2]));
    Abi = Abio;
    Abo = _mm_unpackhi_epi64(Abio, Abio);
    Cio = Abio;
    Abu = _mm_loadl_epi64((const __m128i *)&(stateAsLanes[4]));
    Cua = Abu;
    Agae = _mm_loadu_si128((const __m128i *)&(stateAsLanes[5]));
    Aga = Agae;
    Abuga = _mm_unpacklo_epi64(Abu, Aga);
    Age = _mm_unpackhi_epi64(Agae, Agae);
    Abage = _mm_unpacklo_epi64(Aba, Age);
    Cae = _mm_xor_si128(Cae, Agae);
    Agio = _mm_loadu_si128((const __m128i *)&(stateAsLanes[7]));
    Agi = Agio;
    Abegi = _mm_unpacklo_epi64(Abe, Agi);
    Ago = _mm_unpackhi_epi64(Agio, Agio);
    Abigo = _mm_unpacklo_epi64(Abi, Ago);
    Cio = _mm_xor_si128(Cio, Agio);
    Agu = _mm_loadl_epi64((const __m128i *)&(stateAsLanes[9]));
    Abogu = _mm_unpacklo_epi64(Abo, Agu);
    Cua = _mm_xor_si128(Cua, Agu);
    Akae = _mm_load_si128((const __m128i *)&(stateAsLanes[10]));
    Aka = Akae;
    Ake = _mm_unpackhi_epi64(Akae, Akae);
    Cae = _mm_xor_si128(Cae, Akae);
    Akio = _mm_load_si128((const __m128i *)&(stateAsLanes[12]));
    Aki = Akio;
    Ako = _mm_unpackhi_epi64(Akio, Akio);
    Cio = _mm_xor_si128(Cio, Akio);
    Akuma = _mm_load_si128((const __m128i *)&(stateAsLanes[14]));
    Cua = _mm_xor_si128(Cua, Akuma);
    Ame = _mm_loadl_epi64((const __m128i *)&(stateAsLanes[16]));
    Akame = _mm_unpacklo_epi64(Aka, Ame);
    Cae = _mm_xor_si128(Cae, _mm_unpackhi_epi64(Akuma, Akame));
    Amio = _mm_loadu_si128((const __m128i *)&(stateAsLanes[17]));
    Ami = Amio;
    Akemi = _mm_unpacklo_epi64(Ake, Ami);
    Amo = _mm_unpackhi_epi64(Amio, Amio);
    Akimo = _mm_unpacklo_epi64(Aki, Amo);
    Cio = _mm_xor_si128(Cio, Amio);
    Amu = _mm_loadl_epi64((const __m128i *)&(stateAsLanes[19]));
    Akomu = _mm_unpacklo_epi64(Ako, Amu);
    Cua = _mm_xor_si128(Cua, Amu);
    Asase = _mm_load_si128((const __m128i *)&(stateAsLanes[20]));
    Cae = _mm_xor_si128(Cae, Asase);
    Asiso = _mm_load_si128((const __m128i *)&(stateAsLanes[22]));
    //Error here, last line. Access violation reading location.

The thing is that when no compiler optimizations are on, the code runs fine, no error but as soon as you turn on the Full Speed Optimization, the access violation reading violation pops. Not to mention that this code runs on GCC no matter what the optimizations are.

You may have some solutions for the reading violation or even some solutions on how to turn this "GCC code only" to a Visual Studio compilable code.

answered question

1 Answer

11

Presumably stateAsLanes is aligned by 16. _mm_load_si128((const __m128i *)&(stateAsLanes[22])); is doing a 128-bit alignment-required load from a misaligned address.

Are you sure that was in the original source? Anyway, it needs to be a loadu, not load, to tell the compiler that it's not aligned.

You aren't using _mm_castpd_si128 anywhere in this code, so it's not clear what you changed or why you'd have to change this. It was broken for GCC/clang as well, which would use movdqa even in un-optimized code.

With MSVC it presumably breaks with optimization because MSVC folds the load into a memory operand for some later ALU instruction; IIRC, when MSVC and ICC have to use a stand-alone mov load, they usually use a movdqu unaligned load. That would certainly explain the behaviour you see, even though it will make the code run slower than necessary on Core 2.

posted this

Have an answer?

JD

Please login first before posting an answer.