Try to Predict S&P500

Posted on Dim 23 septembre 2018 in Machine Learning

Predict the S&P500 Index

The goal is to try to compute the S&P500 with indicators.

I will train a model from 1950-2012 and make predictions from 2013-2015.

In [44]:
import pandas as pd
import numpy as np
from datetime import datetime
In [45]:
df = pd.read_csv('YAHOO-INDEX_GSPC.csv')

Reading Data

In [46]:
df["Date"] = pd.to_datetime(df["Date"])
df.sort(columns = ["Date"], ascending = True, inplace = True)
df.head()
/Users/comalada/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  app.launch_new_instance()
Out[46]:
Date Open High Low Close Volume Adjusted Close
16847 1950-01-03 16.66 16.66 16.66 16.66 1260000.0 16.66
16846 1950-01-04 16.85 16.85 16.85 16.85 1890000.0 16.85
16845 1950-01-05 16.93 16.93 16.93 16.93 2550000.0 16.93
16844 1950-01-06 16.98 16.98 16.98 16.98 2010000.0 16.98
16843 1950-01-09 17.08 17.08 17.08 17.08 2520000.0 17.08

Indicators

I want to teach the model how to predict the current price from historical prices

We should not include current price in the indicator otherwise it will be impossible to predict a future index

In [78]:
df['mean_5day'] = pd.rolling_mean(df['Close'], window = 5).shift(1)
df['mean_30day'] = pd.rolling_mean(df['Close'], window = 30).shift(1)
df['mean_365day'] = pd.rolling_mean(df['Close'], window = 365).shift(1)
#I use shift because rolling include the current price so I assign mean to the next row.
/Users/comalada/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=5,center=False).mean()
  if __name__ == '__main__':
/Users/comalada/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:2: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=30,center=False).mean()
  from ipykernel import kernelapp as app
/Users/comalada/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=365,center=False).mean()
  app.launch_new_instance()

Filter Data

In [82]:
print(stocks.iloc[365])
Date              1951-06-19 00:00:00
Open                            22.02
High                            22.02
Low                             22.02
Close                           22.02
Volume                        1.1e+06
Adjusted Close                  22.02
Name: 16482, dtype: object
In [83]:
filtered_df = df[df["Date"] >= datetime(year=1951, month=6, day=19)]
filtered_df.head()
Out[83]:
Date Open High Low Close Volume Adjusted Close mean_5day mean_30day mean_365day
16482 1951-06-19 22.020000 22.020000 22.020000 22.020000 1100000.0 22.020000 21.800 21.703333 19.447726
16481 1951-06-20 21.910000 21.910000 21.910000 21.910000 1120000.0 21.910000 21.900 21.683000 19.462411
16480 1951-06-21 21.780001 21.780001 21.780001 21.780001 1100000.0 21.780001 21.972 21.659667 19.476274
16479 1951-06-22 21.549999 21.549999 21.549999 21.549999 1340000.0 21.549999 21.960 21.631000 19.489562
16478 1951-06-25 21.290001 21.290001 21.290001 21.290001 2440000.0 21.290001 21.862 21.599000 19.502082

1951-06-19 correspond to the date we get the mean of 365 days.

Split Data for Training and Test

In [111]:
df_drop_na = filtered_df.dropna(axis = 0)

train = df_drop_na[df["Date"] < datetime(year=2013, month=1, day=1)]
test = df_drop_na[df["Date"] >= datetime(year=2013, month=1, day=1)]
/Users/comalada/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  app.launch_new_instance()
/Users/comalada/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:4: UserWarning: Boolean Series key will be reindexed to match DataFrame index.

Prediction

In [112]:
from sklearn.linear_model import LinearRegression
import math
model = LinearRegression()
x = ['mean_5day', 'mean_30day', 'mean_365day']
y = ['Close']

model.fit(train[x], train[y])
predictions = model.predict(test[x])

#I use Mean Absolute error to show how close I was to the index.
mae = np.sum(abs(predictions - test[y]))/len(predictions)
print(mae)
Close    16.621037
dtype: float64
In [101]:
print(model.score(train[x], train[y]))
0.999550132408

Addition of Standard Deviation Indicator to increase the prediction

In [113]:
df_drop_na['std_5day'] = pd.rolling_std(df['Close'], window = 5).shift(1)

train = df_drop_na[df["Date"] < datetime(year=2013, month=1, day=1)]
test = df_drop_na[df["Date"] >= datetime(year=2013, month=1, day=1)]

x = ['mean_5day', 'mean_30day', 'mean_365day', 'std_5day']
y = ['Close']

model.fit(train[x], train[y])
predictions = model.predict(test[x])

#I use Mean Absolute error to show how close I was to the index.
mae = np.sum(abs(predictions - test[y]))/len(predictions)
print(mae)
Close    16.617805
dtype: float64
/Users/comalada/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning: pd.rolling_std is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=5,center=False).std()
  if __name__ == '__main__':
/Users/comalada/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  app.launch_new_instance()
/Users/comalada/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:4: UserWarning: Boolean Series key will be reindexed to match DataFrame index.