Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.rc("figure", figsize=(16,8))
plt.rc("font", size=14)
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.984
Model: OLS Adj. R-squared: 0.983
Method: Least Squares F-statistic: 928.9
Date: Sat, 06 Feb 2021 Prob (F-statistic): 3.80e-41
Time: 16:48:16 Log-Likelihood: 2.3691
No. Observations: 50 AIC: 3.262
Df Residuals: 46 BIC: 10.91
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.1219 0.082 62.458 0.000 4.957 5.287
x1 0.4837 0.013 38.243 0.000 0.458 0.509
x2 0.5293 0.050 10.646 0.000 0.429 0.629
x3 -0.0191 0.001 -17.187 0.000 -0.021 -0.017
==============================================================================
Omnibus: 0.352 Durbin-Watson: 1.959
Prob(Omnibus): 0.839 Jarque-Bera (JB): 0.364
Skew: -0.184 Prob(JB): 0.834
Kurtosis: 2.802 Cond. No. 221.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.64477526 5.12701216 5.56837103 5.94000426 6.22347521 6.41378725
6.52020475 6.56473091 6.57849274 6.59662697 6.65250716 6.77226056
6.97047544 7.24780431 7.59085668 7.97439911 8.36550098 8.72894489
9.0330128 9.25469719 9.38348115 9.42306627 9.39076448 9.31465374
9.22896494 9.16845781 9.16271006 9.23125779 9.38038579 9.60209496
9.87541657 10.16985728 10.45040922 10.6833016 10.84154918 10.90938589
10.88485513 10.78013122 10.61951923 10.43546169 10.26320841 10.13502501
10.07489175 10.09456371 10.1916394 10.34995533 10.54224166 10.7346025
10.89208497 10.98442182]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.97961106 10.836072 10.57456268 10.24238798 9.90181781 9.6148412
9.42798927 9.36094275 9.40171351 9.50957959]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");
Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1" : x1, "y" : y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 5.121904
x1 0.483672
np.sin(x1) 0.529322
I((x1 - 5) ** 2) -0.019085
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 10.979611
1 10.836072
2 10.574563
3 10.242388
4 9.901818
5 9.614841
6 9.427989
7 9.360943
8 9.401714
9 9.509580
dtype: float64