discovered with clang's -Wshorten-64-to-32
clang was getting tripped up calculating the tail half-rounds for the 128 bit hashes, so we streamline the round functions so it copes a little better. plus, this is better code anyway.