Speeding up Computation of Pearson Correlation
Main Take-away:
The simple but efficient trick to compute Pearson Correlation faster is to
- Scale the data
- Perform matrix algebra to compute the correlation coefficient.
This simple trick in principle can make the computation 3-4 times faster and can be useful when dealing with large vectors and/or need to compute r for multiple things in parallel. Here I’m going to demonstrate this trick first and then quickly explain why it is the case.
1. Demo
Let’s generate two random vectors, each has the length of
sim_v1 = np.random.rand(100000000,)
sim_v2 = np.random.rand(100000000,)
To compute the pearson correlation between the two vectros, we could use the function pearsonr from scipy.stats
start = time.time()
print(f'The pearson correaltion coefficient is {pearsonr(sim_v1, sim_v2)[0]}')
end = time.time()
print(f'Time elapsed: {end - start}')
The pearson correaltion coefficient is 0.00014023607618081493
Time elapsed: 2.6575210094451904
Or we could scale the two vectors first, and then compute the dot product between them
sim_v1_scale = zscore(sim_v1)
sim_v2_scale = zscore(sim_v2)
N = len(sim_v1_scale)
start = time.time()
print(f'The pearson correaltion coefficient is {np.dot(np.array(sim_v1_scale), np.array(sim_v2_scale))/N}')
end = time.time()
print(f'Time elapsed: {end - start}')
The pearson correaltion coefficient is 0.00014023607618081355
Time elapsed: 1.0209388732910156
Interestingly, the two approaches to compute pearson r gives exactly the same output, but the second approach is much faster.
2. Why it is the case
The typical formula for computing pearson r is