During my COVID downtime, I decided to investigate the concept of economic undercurrents. I wanted to see if there were underlying relationships between any two stocks that could be used to *cough, cough* make smart stock trades. I realize that this sounds like a cross between a fool’s errand and pure alchemy. Here’s where I went with it.
By the time I started this in April, I had already taken several classes in data science and data analysis (on top of having two degrees in number-monkeying). I had yet to tackle a “big data” project in my free time. In the grand scheme of big data, this seemed like a reasonable way to get my feet wet. The first step was data acquisition – normally the hardest part for engineers. I wanted to grab a “meaningful” and diverse set of stock values over as large of a time span as possible. I settled on the NASDAQ 100 index stocks. While I was scraping the NASDAQ repositories, I figured I’d grab some commodities values as well. I wrote two pieces of code, “Commer” and “Stocker” in Python to go out and grab the information from the NASDAQ site.
After grabbing all of the sweet, sweet stock data, I now had a real fun data science question on my hands: how do you find patterned correlations between two constantly changing values? If you look at my graph above, you’ll see the basic model for what I wanted. In reality, stock prices behave somewhat like this:
As you can see, the stock values go up and down, but generally rise over time. This is good for an investor, but bad for a data analyst – a $5 increase 10 years ago can have a different meaning than a $5 increase 10 days ago. If I was going to have any luck with this, I would need to perform clever feature extraction and normalization on the values. Here’s the basic algorithm I used:
Now I’ve got a somewhat normalized “local” value for each stock that allows me to compare a single growth trend in 2010 to a completely different growth trend in 2020. In this context, I use “local” to mean a value that appears in the same time period as the original stock value.
The next step was to create a correlation matrix between all stocks at an “activation” time and every other stock at a later “response” time. To ease the computational load, I created thresholds for which values to actually run this computation. One important realization I had: It doesn’t matter how much the “activation” stock value increased or decreased – I just needed to know if it tripped a threshold (think of a neural network). Here’s what the algorithm looked like:
Here’s what the result of the algorithm actually looks like in table format:
Each table of values (AKA correlation layer) represents a single “activation” day’s effects on another “response” day’s stock values. For each time window (delta_t), those layers were summed into a full correlation matrix.
Here is where I had a major “a ha!” moment. All this time I had only been thinking about finding responses from positive or negative stimuli from a purely scientific perspective. I had failed to account for the dreaded FALSE POSITIVES. Here’s how I amended my method:
Now that I had a robust method for finding incidents of correlation and aggregating the correlation (or lack of correlation – stupid false positives!), I just needed to create a figure of merit for how well-behaved the correlation relationship behaved.
Here is a snapshot of the results and an explanation of a few other things that I wanted to capture as outputs of the process.
Strength – How big the response from the 2nd stock was after the 1st stock tripped the threshold
Instances – Did this relationship happen once over the course of 10 years, or is it a regularly occurring pattern?
InstDev – This is the standard deviation of instances; it let’s you know how far apart the occurrences of the pattern happened. That might sound useless, but read on for why it’s critical.
Before you go and empty your bank account to throw all of your money into Dover’s magical mystery math machine, take a look at the plot below. I’ve plotted out where the activation threshold was met (gold diamonds) and the corresponding response (black diamonds). My code told me that I had 25 instances of correlation, but I can clearly see that we’ve got something closer to 3 coinciding upward trend sessions.
If I want to go somewhere truly useful with this program, I will need to establish some sort of mechanism to prevent two upward trends causing the instance count to become artificially inflated. The simplest ways to do this would be to either implement a “cool off” period after each paired activation-response occurrence or capture the activation-response days and create “clusters” that would then become the figure of merit instead of pure a instance count. If you want to take a try yourself, here is my code.
I use Spyder 4.1.1 with Python 3.7 as part of an Anaconda environment.
HINT: You need a CSV file of the NASDAQ Index stock names
"""
Stocker.py
Created on Wed Apr 1 18:46:11 2020
@author: DOVER
"""
import pandas as pd
stocks_pd = []
stocks_pd2 = []
NDX = pd.read_csv(r'C:\Users\Dover\Documents\Python Scripts\NDX_Symbols.csv')
num_stocks = NDX.size
stock_name = NDX['SYMBOL'][0]
stock_file = 'https://www.nasdaq.com/api/v1/historical/'+stock_name+'/stocks/2010-04-01/2020-04-02'
stocks_pd = pd.read_csv(stock_file)
stocks_pd.rename(columns={' Open':stock_name}, inplace = True)
stocks_pd = stocks_pd.iloc[:,[0,3]]
for row in NDX['SYMBOL'][1:]:
stock_name = row
try:
stock_file = 'https://www.nasdaq.com/api/v1/historical/'+stock_name+'/stocks/2010-04-01/2020-04-02'
stocks_pd2 = pd.read_csv(stock_file)
stocks_pd2.rename(columns ={' Open':stock_name}, inplace = True)
stocks_pd2 = stocks_pd2.iloc[:,[0,3]]
stocks_pd = stocks_pd.join(stocks_pd2.set_index('Date'),on='Date')
except:
print('No data for '+stock_name)
print(stocks_pd)
stocks_pd.to_csv(r'C:\Users\Dover\Documents\Python Scripts\stocks_pd.csv',index=False)
"""
SC_Clean_avg.py
Created on Sun Apr 5 16:48:51 2020
@author: Dover
"""
import pandas as pd
del_price = []
av_scores = pd.DataFrame()
stx_pd = pd.read_csv(r'C:\Users\Dover\Documents\Python Scripts\stocks_pd.csv')
stx_pd['Date'] = pd.to_datetime(stx_pd['Date'])
stx_pd.sort_values('Date')
cmd_pd = pd.read_csv(r'C:\Users\Ben\Documents\Python Scripts\cmds_pd.csv')
cmd_pd = cmd_pd.interpolate(method='linear',limit_direction = 'forward',limit = 3)
cmd_pd['Date'] = pd.to_datetime(cmd_pd['Date'])
cmd_pd.sort_values('Date')
stxcmd_pd = stx_pd.merge(cmd_pd,left_on='Date',right_on='Date')
del stx_pd
del cmd_pd
dates_pd = stxcmd_pd.iloc[:,0]
stxcmd_pd = stxcmd_pd.fillna(0)
for col in range(1,stxcmd_pd.shape[1]):
col_name = stxcmd_pd.columns[col]
for row in range(20,stxcmd_pd.shape[0]):
local_avg = stxcmd_pd.iloc[row-20:row,col].mean()
if stxcmd_pd.iloc[row,col] == 0 or local_avg <= 0.5:
av_scores.loc[stxcmd_pd.index[row],col_name+'_avg'] = 0
else:
av_scores.loc[stxcmd_pd.index[row],col_name+'_avg'] = ((0.6*stxcmd_pd.iloc[row,col]+
0.3*stxcmd_pd.iloc[row-1,col]+
0.1*stxcmd_pd.iloc[row-2,col])-local_avg)/local_avg
av_scores.to_csv(r'C:\Users\Dover\Documents\Python Scripts\avg_scores.csv',index=False)
"""
ComStock_avg.py
Created on Thu Apr 2 14:04:10 2020
@author: Dover
"""
#run Commer, Stocker, and SC_Clean_avg first to ensure that CSV files are populated
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
startTime = datetime.now()
#scores = pd.read_csv(r'C:\Users\Dover\Documents\Python Scripts\z_scores.csv') # scores using Z-score method
scores = pd.read_csv(r'C:\Users\Dover\Documents\Python Scripts\avg_scores.csv')
results = pd.DataFrame({'Activator':[],'Responder':[],'Days':[],'Strength':[],'SuccessRate':[],'Instances':[],'InstDev':[]})
thresh = 0.05 # % increase over local price (z_score will use ~1-5 stdevs, avgs will use 0.1 [10%])
gthresh = 0.1 # % change used for activation
pgm = scores[scores > (0+thresh)] +1*(scores > (0+thresh)) #positive growth matrix
pgi = (scores>(0+gthresh))*1 #positive growth incidents
ngi = (scores<(0-gthresh))*-1 #negative growth incidents
fpm = np.ones(np.shape(pgm))-1*(pgm>0.9) #false positive matrix for positives
hs_range = range(10,65)
for Hist_Stride in hs_range:
hs_time = datetime.now()
print('Current Iteration: ',Hist_Stride)
if Hist_Stride != min(hs_range):
print('Approx. Time Remaining: ',(max(hs_range)-Hist_Stride+1)*(hs_time-startTime)/(Hist_Stride-min(hs_range)))
#%%
pgm = pgm.fillna(0)
pgi = pgi.fillna(0)
ngi = ngi.fillna(0)
fpm = fpm.fillna(0)
day_pp = [] # day past, positive
day_pn = [] # day past, negative
pcl = [] # positive correlation layer
ncl = [] # negative correlation layer
p_stack = np.zeros((pgm.shape[1],pgm.shape[1])) #stack of positive correlations
n_stack = np.zeros((ngi.shape[1],ngi.shape[1])) #stack of negative correlations
ps_inst = np.zeros((pgm.shape[1],pgm.shape[1])) # instances of positive correlations
ns_inst = np.zeros((ngi.shape[1],ngi.shape[1])) # instances of negative correlations
fpp_inst = np.zeros((pgm.shape[1],pgm.shape[1])) # instances of false positives due to positive activation
fpn_inst = np.zeros((ngi.shape[1],ngi.shape[1])) # instances of false positives due to negative activation
pdays_np = [[None,None,None]] #stores activation days for each successful match
ndays_np = [[None,None,None]]
#%%
for layer in range(Hist_Stride,pgm.shape[0]):
day = pgm.iloc[layer,:]
day_fp = fpm.iloc[layer,:]
day_pp = pgi.iloc[layer-Hist_Stride,:]
day_pn = ngi.iloc[layer-Hist_Stride,:]
pcl = day.to_numpy()*day_pp[:,None] # positive correlation for layer
ncl = day.to_numpy()*day_pn[:,None] # negative correlation for layer
fpp = day_fp.to_numpy()*day_pp[:,None] # false positive due to positive activation for layer
fpn = day_fp.to_numpy()*day_pn[:,None] # false positive due to negative activation for layer
p_stack = p_stack + pcl - 1.*(pcl>1.)
n_stack = n_stack + ncl + 1.*(ncl<-1.)
ps_inst = ps_inst + (pcl>0.001)*1.
ns_inst = ns_inst + (ncl<-0.001)*1.
fpp_inst = fpp_inst + (fpp>0.1)*1
fpn_inst = fpn_inst + (fpn<-0.1)*1
ppairs = np.array(np.transpose(pcl.nonzero()))
npairs = np.array(np.transpose(ncl.nonzero()))
layer_p_np = np.array(np.ones([np.size(pcl.nonzero(),1),1])).astype(int)*layer
layer_n_np = np.array(np.ones([np.size(ncl.nonzero(),1),1])).astype(int)*layer
pdays_np = np.concatenate((pdays_np,np.hstack((ppairs,layer_p_np))))
ndays_np = np.concatenate((ndays_np,np.hstack((npairs,layer_n_np))))
#%%
sr_p = np.nan_to_num(ps_inst/(ps_inst+fpp_inst)) #success rate of positive correlation
sr_n = np.nan_to_num(ns_inst/(ns_inst+fpn_inst)) #success rate of negative correlation
suc_thresh = 0.85 # success rate threshold
inst_thresh = 3 # instances threshold
p_csm = np.nan_to_num(p_stack/ps_inst) * (sr_p > suc_thresh) * (ps_inst > inst_thresh) # positive strength correlation matrix, filtered by success rate
n_csm = np.nan_to_num(n_stack/ns_inst) * (sr_n > suc_thresh) * (ns_inst > inst_thresh) # negative strength correlation matrix, filtered by success rate
num_stds = 0 #number of standard deviations to filter out
p_fsm = p_csm #*(p_csm>(p_csm.mean() + num_stds*p_csm.std())) # filter of strength correlation matrix
n_fsm = n_csm #*(n_csm<(n_csm.mean() - num_stds*n_csm.std())) # filter of strength correlation matrix
print(Hist_Stride)
activ_p, resp_p = p_fsm.nonzero()
inst_dev_p = [] # standard deviation of activation days for each activator-responder pair
for a,r in zip(activ_p,resp_p):
inst_dev_p = np.append(inst_dev_p, pdays_np[(pdays_np[:,0] == a) & (pdays_np[:,1] == r)][:,2].std())
frame_p = pd.DataFrame({'Activator':scores.columns[activ_p],
'Responder':scores.columns[resp_p],
'Days':np.ones(np.shape(ps_inst[p_fsm.nonzero()]))*Hist_Stride,
'Strength':p_fsm[p_fsm.nonzero()],
'SuccessRate':sr_p[p_fsm.nonzero()],
'Instances':ps_inst[p_fsm.nonzero()],
'InstDev':inst_dev_p
})
activ_n, resp_n = n_fsm.nonzero()
inst_dev_n = [] # standard deviation of activation days for each activator-responder pair
for a,r in zip(activ_n,resp_n):
inst_dev_n = np.append(inst_dev_n, ndays_np[(ndays_np[:,0] == a) & (ndays_np[:,1] == r)][:,2].std())
frame_n = pd.DataFrame({'Activator':scores.columns[activ_n],
'Responder':scores.columns[resp_n],
'Days':np.ones(np.shape(ns_inst[n_fsm.nonzero()]))*Hist_Stride,
'Strength':n_fsm[n_fsm.nonzero()],
'SuccessRate':sr_n[n_fsm.nonzero()],
'Instances':ns_inst[n_fsm.nonzero()],
'InstDev':inst_dev_n
})
results = results.append(frame_p,ignore_index=True)
results = results.append(frame_n,ignore_index=True)
results = results[results['InstDev']>15]