Ingap.dev - A Personal Dev Blog

I would spend 55 minutes defining the problem and then five minutes solving it. (Albert Einstein)

Published on Wednesday 23 March 2022

Tags: python7 code5 data3

Calculate correlation coefficient between arrays of different length

Can the Pearson coefficient be computed between two numeric arrays, if these have different size? Let's what does it mean.


Scenario

If we look at the definition of pearson correlation coefficient for discrete series of data, we see that it's mathematically wrong to consider data having different length. Indeed, one should keep in mind that we shouldn't create missing data out of nothing. BUT sometimes, in real-world problems we may want to use simple "good enough" procedures instead of perfect impossible ones, so...

Our scenario is the following: we have two pandas time series. We want to understand how do they correlate with each other, but their size or sampling rate is different and some data can miss.

Procedure

  • Select a startTime and endTime index in order to match overlapping time periods and reduce both vectors to that range. This must be done because we do NOT want to extrapolate any data outside a that range, but only interpolate missing data (...it's less error prone!). Let's call the resulting data vectors sMin and sMax where len(sMax) >= len(sMin).
  • We choose to interpolate the longest vector, i.e. sMax. That's because we assume it's more dense, i.e. has more elements over same time range and makes interpolation errors generally less relevant. So we turn DatetimeIndex (of both series!) into absolute integer values t, and finally create an interpolation function f(t) based on sMax.
  • Use f(t) to compute missing values of sMin. Resulting values are put into a new vector, which we call sInt.
  • Now sMin and sInt have same length, so we can compute the correlation index between them.

Python code

from scipy import interpolate
from numpy import corrcoef
 
def corr(s1, s2):
    startDate = max(min(s1.index), min(s2.index))        
    endDate = min(max(s1.index), max(s2.index))
    s1 = s1.loc[(s1.index >= startDate) & (s1.index <= endDate)]
    s2 = s2.loc[(s2.index >= startDate) & (s2.index <= endDate)]
    s1.index = map(int, s1.index.strftime("%Y%m%d%H%M%S"))
    s2.index = map(int, s2.index.strftime("%Y%m%d%H%M%S"))
    sMin, sMax = (s2, s1) if len(s1) >= len(s2) else (s1, s2)
    f = interpolate.interp1d(sMax.index, sMax.values)
    minBound = min(sMax.index)
    maxBound = max(sMax.index)
    sMin = sMin[(sMin.index >= minBound) & (sMin.index <= maxBound)]
    sInt = f(sMin.index)
    return corrcoef(sMin, sInt)[0,1]

Testing with Mock Data

import datetime
import pandas as pd
 
def corr(s1, s2):
    ...
 
if __name__ == '__main__':
    i1 = pd.date_range(start=datetime.datetime(2019, 1, 1), end = datetime.datetime(2022, 1, 1))
    i2 = pd.date_range(start=datetime.datetime(2020, 2, 3), end = datetime.datetime(2022, 3, 3))
    s1 = pd.Series(index=i1, data=[i for i in range(len(i1))])
    s2 = pd.Series(index=i2, data=[i**2 for i in range(len(i2))])
    print(corr(s1, s2)) # Output: 0.9681593362286723