Numpy/Cupy do explicit coercing to float64. There is no documentation for why this is done, but since GEMM is used to compute the covariates (summation over data points), it makes sense to increase the precision.
You could get away using a double to only accumulate the sum, but it's a pain to write such mixed-precision (slow) function in C and then wrap it in Python, esp. for something like this.
You could get away using a double to only accumulate the sum, but it's a pain to write such mixed-precision (slow) function in C and then wrap it in Python, esp. for something like this.
(See, https://github.com/numpy/numpy/blob/v1.17.0/numpy/lib/functi...)