Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16,8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.984
Model:                            OLS   Adj. R-squared:                  0.983
Method:                 Least Squares   F-statistic:                     928.9
Date:                Sat, 06 Feb 2021   Prob (F-statistic):           3.80e-41
Time:                        16:48:16   Log-Likelihood:                 2.3691
No. Observations:                  50   AIC:                             3.262
Df Residuals:                      46   BIC:                             10.91
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.1219      0.082     62.458      0.000       4.957       5.287
x1             0.4837      0.013     38.243      0.000       0.458       0.509
x2             0.5293      0.050     10.646      0.000       0.429       0.629
x3            -0.0191      0.001    -17.187      0.000      -0.021      -0.017
==============================================================================
Omnibus:                        0.352   Durbin-Watson:                   1.959
Prob(Omnibus):                  0.839   Jarque-Bera (JB):                0.364
Skew:                          -0.184   Prob(JB):                        0.834
Kurtosis:                       2.802   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.64477526  5.12701216  5.56837103  5.94000426  6.22347521  6.41378725
  6.52020475  6.56473091  6.57849274  6.59662697  6.65250716  6.77226056
  6.97047544  7.24780431  7.59085668  7.97439911  8.36550098  8.72894489
  9.0330128   9.25469719  9.38348115  9.42306627  9.39076448  9.31465374
  9.22896494  9.16845781  9.16271006  9.23125779  9.38038579  9.60209496
  9.87541657 10.16985728 10.45040922 10.6833016  10.84154918 10.90938589
 10.88485513 10.78013122 10.61951923 10.43546169 10.26320841 10.13502501
 10.07489175 10.09456371 10.1916394  10.34995533 10.54224166 10.7346025
 10.89208497 10.98442182]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.97961106 10.836072   10.57456268 10.24238798  9.90181781  9.6148412
  9.42798927  9.36094275  9.40171351  9.50957959]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");
../../../_images/examples_notebooks_generated_predict_12_0.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           5.121904
x1                  0.483672
np.sin(x1)          0.529322
I((x1 - 5) ** 2)   -0.019085
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    10.979611
1    10.836072
2    10.574563
3    10.242388
4     9.901818
5     9.614841
6     9.427989
7     9.360943
8     9.401714
9     9.509580
dtype: float64