Time-Series are a classical application area for Artificial Intelligence (AI) technologies and methods. Due to the current hype in FinTech and AI especially financial time-series of the global financial markets such as indexes, forex exchange rates, gold prices, interest rates, public equities, futures and many others are increasingly analysed and predicted using modern AI models such as Deep Learning (DL).
Source: NoSimpler Analytics
So, not surprisingly a growing number of specialised and non specialised firms and major financial institutions (banks, brokers, funds, hedge funds, AI firms, FinTech firms etc) are now deploying or at least experimenting with AI models and algorithms in the hope to forecast these real world time-series better than by using the classical mathematical models that have been around for decades like linear or higher order Regression Models (see picture above), ARIMA, ARFIMA, VAR models and so on.
I will discuss here in a short series of posts which modern AI models and algorithms seem to be fit for the task and can promise a measurable progress in the analysis and forecasting of such time-series and on the other side, which models will probably not bring any statistically significant progress in the long run.
I am trying to give an objective assessment and judgement about the use and advantages/disadvantages of the discussed various AI models and the associated issues with them.
This post is meant for readers with some background in finance and also in AI but who have little or no practical experience with such AI models yet. I will therefore avoid detailed mathematical discussions and complicated mathematical and statistical arguments and rather focus on sharing general insights I have won over many years of applying all sorts of AI models (even my own) to complex financial applications.
The obvious first choice: Deep Learning?
Many companies are now getting involved in applying AI by using some standard Deep Learning networks (see pic above) and related models like Convolutional Nets, Reinforcement Learning, Adversarial Learning Models etc. That is often because it is very rare to find any articles or reviews today about AI which do not mention these techniques and models and do not also praise them.
Since there is therefore no need to further hype up these models, I will focus here mostly on the problems one faces when trying to use DL and related models in finance and especially in forecasting applications.
If you want to learn and understand the problems and issues that come with DL to get a more balanced view or make a more sophisticated investment decision, then I hope I can give you some first insights on what to really expect and what you will need to deal with when using DL here:
Operational Problems: Batch Processing
Conventional DL models need to be trained in a batch mode (even though some companies are working on variants that do not work in batch mode anymore, the vast majority of DL versions still do). This means, these models need to be trained, then validated and tested and then the training process needs to be halted so they can be deployed.
While deployed, the DL models cannot adjust or learn anymore. If problems come up with the performance of these models, they have to be taken offline and need to be re-trained. The problem with this is that not only does this process usually disturb operations (unless you have a hot stand-by version of the DL net that keeps performing while the master model is being re-trained), but there is also no guarantee whatsoever that the newly retrained models perform better than the replaced ones.
Correcting an existing DL model by re-training is no easy fix (see most of the issues discussed below) and it starts the whole deployment procedure from scratch. Re-training a DL model is not like patching or fixing errors in classical software projects, it's more like a completely new development project without knowing exactly where the error is.
Back-Propagation: Classical Gradient Descent Algorithm
Source: Lazy Programmer website
DL models work with a training algorithm (called back-propagation) that is essential a classical mathematical (steepest) gradient descent algorithm. These are well known and studied already for decades in mathematics. So, DL is nothing really new, its rather an old mathematical optimisation model in new PR and marketing clothes. Their key and known problem is that they are not guaranteed to find the best global adaptation of their parameters (weights) and hence they can easily get stuck in local optima while searching rather than finding global optimal parameter sets. This means, when applied to time series analysis and forecasting, it is in no way guaranteed that they can generate optimum forecasting results.
Costs of Learning and Computation
Source: Victor Lavrenko, 2014
The training of these networks is extremely computational expensive. The calculation of every single weight in the DL network requires the algorithm to iteratively solve nested differential equations of various complex transfer functions. A serious DL network however, may have easily a million or more of such adaptable weights and each weight needs to be re-adjusted for every single training example.
For example, if a time series has say 1,000 values/data points (say a 3+ years stock chart with the daily closing price for trading as one data point entry per day), the DL network needs to calculate weight adjustments a 1,000 times (for each training example) for a million weights, so a billion (!) adjustments - and this is just for showing the network all training points a single time. In reality the network might need to be shown all the training examples easily hundreds or thousands of times more to finally adjust all weights so well that the forecasts are below an expected and tolerated error rate on a test set of data. To achieve such massive amounts of calculations in a reasonable timeframe (days or hours of training instead of months or years), DL networks usually need to be trained on massive server farms or a large number of parallel computers or a combination of both.
To solve these training time issues, there are often several practical tricks and techniques used. For example: pre-training the nets for a few cycles with very simple algorithms, training only part of the weights in each cycle, using genetic algorithms to pre-set the weights rather than to start with random weight assignment values or to approximate the net structure with some algorithms. But all these are not scientific approaches and mostly just unguided trial and error - the tricks might work in certain situations or they might not. Success and time/effort reduction is not guaranteed. And even if these or similar tricks work somehow, you may not understand at the end how and why they actually work, which is not exactly helpful either (see also below).
Structure and Design of the DL Networks - plenty of trial and error
Source: Github, CS231n Convolutional Neural Networks for Visual Recognition
Another major issue with DL networks for any application and time series forecasting is how to find the best overall design and structure of the network. There is no known theory to decide how to design the best network. This is a big problem in practice because there is no guidance for example for how many layers the network shall have, how many neurons per layer, what transfer function the neurons shall use, what threshold values for activation or how the layers of neurons shall be connected with each other etc. The best help here will probably come from experienced engineers that have had already lots of experience with different DL networks in various applications and can "guess" how to start and progress over time to improve the models.
Obviously, the overall structure of the DL network has also a major impact on computing costs and efforts. You can use a small or a large network with millions of weights. But there is no guarantee at all that the larger or more complex network performs better than a simpler and smaller one with much less neurons and less layers (see the discussion of over- and under-fitting below).
The best design development approach seems to be to start out with a reasonably small network and train it and see what kind of results can be achieved and then go from there and slowly enlarge the network with additional neurons, layers, connections etc. until the prediction results approach a more acceptable level.
To stress this point again: there are many parameters that need to be set and/or learned in the DL networks for which there is neither a theoretical nor any practical guidance - it is mostly a massive trial and error effort to find the best ways how to work with these DL nets. There are some approximation methods (for example using some sort of genetic algorithms to find the best start configuration for a DL network architecture and structure) but they usually require additional computing time and resources to be added to the already very expensive and time consuming processing of the nets.
If you have deep pockets and plenty of cash and massive server farms available you may run several learning trials with different DL nets concurrently and then select the best performing DL net and run the training again with the best other nets as competitors (a kind of survival-of-the-fittest approach to the DL net design).
Source: Silver, D. etal. Nature Vol 529, 2016 Bob van den Hoek, 2016
This was the method used by Google/Deep Mind (AlphaGO) to finally come up with a network that was able to beat the human GO champion. Such an evolutionary approach to finding the best network architectures and parameter sets can be used in general for any kind of application of DL nets and may yield decent results - but bear in mind the potentially gigantic computing costs and efforts associated with such an approach if you are not Google.
Source: (CMP305) Deep Learning on AWS Made EasyCmp305
Even when a good starting network structure and a good set of initial weights for a problem has been found and identified the best and matching overall training approach for the DL net still has to be found.
The standard and intuitive training approach for a DL net to handle the forecasting of a time-series is to select and define a "gliding window" over the data points (see picture below). A fixed time window of the data points (say 20 days of sequential closing prices for a certain stock representing roughly one months of training data) needs to be selected that cuts the training set into fixed chunks to be fed into the input layer of the network and the expected forecasting date (say the 21st day closing price) is put at the output neurons. This is done repeatedly over the whole time series until the final date of available training days is reached. The expectation with this training approach is that whenever any future 20 consecutive day’s time window is input to the net, the DL will have learned how to predict the 21st day following from these inputs.
It is obvious that the width of the window is crucial here. Should it be 20 consecutive days or 30, or 60 or 180 - or any other number? The width of the window is only limited by the length and size of the available data set - but other than that, it can assume any size. Again, there is no theoretical guidance on how to select the optimum window width except for rare cases. Also, why use a window of consecutive dates for the training? Once could also use nonconsecutive periods or a combination of sequential dates and add some other dates to the input window. There is no reason why this should not work equally well or even better - or worse for that matter.
And how shall the training samples and the sliding window be applied during training? A certain sequence has to be applied, but which one? We have just assumed the simplest solution is that the window slides smoothly from the beginning of the time-series over the whole series through the end of it. But it is not clear at all if this is necessary or even helpful. The window could also be just applied in a random sequence jumping from one start date to some random other. However, for the adjustment of the weights the exact sequence of applying the training samples can have a substantial effect on the weight adjustments and hence on the prediction results!
Over- and under-fitting - or when to stop Training?
Assuming the sequence in which the training data has to be presented and the data window problem has been resolved, it is still very difficult to determine when to stop the training process for the DL network.
J. Braz. Soc. Mech. Sci. & Eng. vol.32 no.2 Rio de Janeiro Apr./June 2010
How can we decide when the training of the DL network can and shall be stopped and is sufficient for a certain task? At first sight, this seems to be an easy to resolve problem: just follow the development of the overall output (forecasting) error of the DL network. As long as the error seems to go further down, continue learning. Once the error goes back up, stop the training. But this approach does not work, as I have already explained above, because the DL network follows a gradient descent model which can easily get stuck in local optima.
Another approach would be to just stop learning once the network reaches a certain absolute success level, say an error rate less than 10% or 5% or any number that seems reasonable and acceptable for the problem to solve. However, this is also not a good approach. If a DL net can correctly solve a problem in some 90% or more test cases this could just mean that the network is simply too big and has too many weights.
Erickson, Bradley James et al. “Machine Learning for Medical Imaging.” Radiographics : a review publication of the Radiological Society of North America, Inc 37 2 (2017): 505-515.
When a network has too many weights it can pretty much just store all relevant features of the input data in a distributed manner in the weights (this effect is called "overfitting" - see picture). That means, the DL network pretty much "learned the data by heart".
The key problem with this is that such networks may not generalise well. They may mostly just recognise with a high degree inputs that are exactly like or very similar to the training samples it has seen during the training process, but they may perform badly on inputs that vary stronger from the seen data set.
So, over-fitting means the DL nets cannot generalise well from what they have learned. This however is a feature you want and need for most applications and what would make an AI model "intelligent" - otherwise you could just use a very simple Excel sheet and define a simple allowed tolerance (say 1% divergence from the training data) to make predictions.
The opposite effect, called "under-fitting", is also possible. The network might not have the capacity to classify the input patterns with the needed precision.
A typical result and effect of underfitting DL nets that one can see often in tests and even published papers about DL forecasting successes is when the forecasted curve generated by the underfitting DL net looks like a copy of the time-series itself just with a shift to the right, a delay (see the blue line in the picture above).
This effect is usually generated by the DL network when it is "detecting" an implicit rule that often produces good predictions for some time-series at first sight. This rule applies when the n+1 data point is more likely equal to or very similar to the nth data point, the point before. This expresses the statistical effect that in many real-world time series (also many financial time series) a substantial change of the direct neighbouring data is more un-likely than that the neighbouring data stays in the same narrow range.
For example, one can easily generate a weather forecast by predicting that certain weather patterns (temperature, rain, humidity etc.) of two consecutive days in one location stay similar. This may easily generate prediction success rates of 70+% since drastic weather changes in most locations are usually relatively rare. The weather tomorrow is more often like the weather today than not in most locations. We are usually not aware of this simple statistic rule because we are mostly only consciously noticing when the weather is indeed changing, not when it is stable. But this rule as a basis for predictions is useless - since the key point for most of the weather predictions is to find and predict exactly the days and times in advance when the weather actually changes noticeably!
So, over-fitting is not good and under-fitting neither. Where does this leave us? Both cases are relatively easy to detect. In the under-fitting case the prediction error margin will often be way too high and above acceptable levels already during testing. Over-fitting is harder to detect. At first sight the high success rate is comforting, and most companies would see this as a sign of success rather than a failure. But this just shows a lack of experience with DL models. A very low error rate - and especially if you have comparable data for human performance - is more a sign to be sceptical than to be satisfied. In any real world scenario, a success rate above 90% or so indicates a methodological, conceptual or data corruption failure - and not success.
This creates a serious problem in areas where a very high success rate is needed or required before you can deploy an AI model. This is the case in many safety related applications (self-driving cars for example), in medicine (cancer diagnostics) or in time series predictions if the AI model is used for example in high frequency trading and has to act in the millisecond range without possible oversight of humans or other safety mechanisms while trading large amounts of money.
To summarise this problem area: in general, there is no theory or practical guide to decide when to stop training for a DL net. The best approach here is again to have experienced engineers to make the judgement. Considering the above arguments however, any results of 90+% correctness and performance of the DL nets should rather worry you than satisfy you!
DL Success Stories - fake News? Or even worse?
Whenever you hear a DL network is better at a certain task than all competitors or humans, you shall be very skeptical and read such announcements with real scrutiny and look for the vested interests behind these announcements. You better think of such announcements as if they were issued by Donald Trump.
Source: Technology Review
A wonderful and shocking example for this is an unbelievable result announcement that was recently published by some Chinese DL researchers (see picture above). They claimed that a DL network could recognise and identify criminals by their faces! Yes, by their faces, and this with a 90+% correctness.
The Use of biased Data
The above example of the fake face recognition features leads us to another critical issue (which also extends to other learning models): the selection, quality and use of training and testing data.
What can explain the 90+% success rate in this case (assuming the authors were not also cheating and pumping up their results)? The answer lies in the training data compilation and collection process. The researchers have used two sets of pictures for training the DL net. The first set was a set of photos of criminals and the second of "normal" people randomly selected. Now, it is easy to get pictures from every day people. You can compile them yourself from any social networking site.
However, how can you get say 1000 pictures of criminals? Pretty much the only way to do so is to strike some kind of deal with the police or government authorities. But these pictures will be highly standardised. They will all or mostly been taken after the criminals were arrested and their photos were taken during the process in some police stations or government correctional facilities.
And this is exactly what created the problem and the high "success" rate claimed. What the DL net really detected were not unique features of the criminals faces but some general features of the photo stage settings such as the light used (intensity, color, angle of the light source, shadows etc.), some common features of the background of the pictures taken (probably always the same background wall texture or color for the criminals as they were taken in some government buildings as compared to the pictures of the non-criminals that probably varied in every single picture) or some other irrelevant features like the clothes worn by the criminals (jail clothes) that had a statistical significance.
This case shows nicely how such an "innocent" and seemingly simple task as collecting the training and test data can determine the outcome and the success rates claimed for DL. It’s a classical example of the "garbage in, garbage out" principle. However, in many cases it’s not that trivial to detect the "garbage in" part like in the above case.
To see where the researchers screwed up in the face recognition example was relatively easy. But when you consider financial applications such as time-series forecasting with thousands of data points which are all just plain numbers in a certain range, things can get much trickier. To analyse such numerical training data to see if they contain some kind of bias or noise (random or not) can be very difficult mathematically. As a matter of fact, to find such hidden but statistically significant features in the data set is exactly what the DL nets are supposed to do in the first place!
DL and the missing Concept of Time
Now, let’s leave the data problem area and proceed and discuss one of the final and most crucial but very often not seriously considered limitations of DL networks. It is the lack of a concept of time in DL. Even experienced engineers miss this issue frequently. A DL net does not have a mechanism or feature to "understand" and process the concept of time!
When a set of time related sequential data points is put into the input layer of the network (say the first 20 consecutive days of a stock's closing price), a DL net cannot recognise these as a sequence or points in time. For the DL net all inputs that it can detect at any time are seen by the net as data points at the same time, simultaneously!
A DL net does not "understand" that there is a timing element represented in the input data and between the gliding window data frames. If it sees 20 days of stock price and then some other 20 days, it cannot realise that they are to be seen in sequence even when they are shown after each other all the time during training. A DL net sees any input like a single picture frame of a movie, but it cannot relate two different frames of the movie in time and realise that one frame is to be understood as a precondition to some later frames. For the DL net, every single movie frame is independent of the others.
Two different inputs are just seen by the DL net as two different training samples with two different potential correlations between the input array and the expected output prediction value(s). The DL net will always just try to associate one set of input data (the first 20 days) with an expected output (the prediction for the 21st day) and the next set of input data with another expected output - but it does never relate them to any timing framework of "before" and "after" which is crucial in time-series.
It seems, however that a DL net can easily learn the concept of a "successor" and hence of time. A net can for example be trained by showing it frame n as input and frame n+1 as the expected output, then frame n+1 as input and frame n+2 as expected output and so on. In this way it seems to be able to learn and predict what to expect when shown a certain input frame n+k. But beware that this is very misleading. If you feed this net a frame of the movie which it has not seen during training, it will not generate the successor frame from the movie but just some statistical mashed up correlation of pixels it learned from all the previous seen pixels.
This is why a DL net cannot even reliably learn the successor function of the simplest time series, the series of natural numbers: 1, 2, 3, 4, 5, 6...and so on. The training of this series always needs to stop at some finite maximum number N. Even if the net produces the correct successor for any number k (smaller than N) as k+1, it is completely unclear what it might issue as "successor" of a large number bigger than N it has never seen before during training, say the 120.000.000th prime number P. Chances are very high that it will produce any number as the output, but not the successor of this number P.
This major limitation of DL networks for any time related patterns is kind of a killer argument against using DL for time-series analysis or forecasting in most cases (they may still have some use in detecting patterns in time-series for example "shoulder-head-formations" or other, more complex patterns, but only if these patterns are represented well in the overall statistics of the input data independent of their timing relation).
Source: Hochreiter & Schmidhuber, 1997
Several DL variations have been proposed and created to cover up or compensate for this major limitation. One of the best known example is the LSTM (long short term memory, see picture on the right) model that introduces a kind of memory for past inputs and therewith a timing component and also allows feedback loops to create another potential time encoding. However, these alternative DL variant models have also problems and other limitations themselves which I don't want to discuss here right now. I will evaluate and discuss some of these models in another post.
Convolutional DL Nets
Source: CUDA-ConvNet: Open- Source GPU accelerated code by [Krizhevsky etal,, NIPS 2012]
The most prominent variants and extensions of DL nets are so called Convolutional Nets. These are DL nets with one or more pre-processing layers attached to the DL nets before the input is fed into the Dl nets themselves. This makes sense in a lot of applications.
In time-series one can usually easily detect some simple regularities like general underlying trends (up, down, sideways in stock markets for example) or seasonality effects that occur always in the same time intervals. If the raw data is fed into a DL net without preprocessing it may need lot of computational resources to detect and filter out these simple components of the series. Hence, most convolutional layers act as filters of the data so the DL net can be used to focus on the much more difficult to recognise patterns in a time series (see picture below).
GRAL, Java Graphing
The advantage of using convolution these days is that the convolution layer can also be trained by a back-propagation derivative and be automated, which wasn't the case in earlier times where the convolutions had to be hand programmed and customised for every application (see picture below).
Hiroshi Kuwajima, 2014
The advantage of using convolution these days is that the convolution layer can also be trained by a back-propagation derivative and be automated (which wasn't the case in earlier times where the convolutions had to be hand programmed and customised for every application).
No free Lunch for using Open-Source Packages for DL
The hype around DL has lured many companies already into using DL models and to even use complete DL software packages because they are offered for free and are now often also available as open source.
KDNuggets, Getting Started with Deep Learning
Matthew Rubashkin, Silicon Valley Data Science
But be warned: it is no incident in my eyes that all the big Internet, mobile and social network companies active in this field are strongly promoting and further hyping the use of these DL networks (like Microsoft, Google, Amazon, Facebook etc). They are giving them away for free and as open-source packages, but they also offer at the same time cloud services on which to run these packages - but obviously these services are NOT for free!
The big players not only want you to use their products and infrastructure to become dependent on them but also want you to use their systems so they get a much further reach into new application areas (such as FinTech) and new clients (similar to what happened after Google gave away for free its Android mobile operating system - it was a calculated move as they now essentially own and dominate the whole mobile OS market with a huge margin over any competitor like Apple).
KDNuggets, Getting Started with Deep Learning
Matthew Rubashkin, Silicon Valley Data Science
For DL open-source packages there is no such thing as a free lunch - you will strongly depend upon their systems over time and it may be way too costly to switch after you have at some point already invested substantial effort and money into this over the years. You also always need to consider that the providers of the free tools can change their open-source licenses and terms to your disadvantage at any time.
Deep Learning networks and their variants can be used and should be used in many areas with good and valuable results that would be hard to achieve with any other technology.
However, I do not recommend their specific use in time-series forecasting applications for all the reasons explained above. The most critical points for me are: the very high computing costs and efforts, the need of massive non-biased training data, the over- and under-fitting problem, the lack of mechanisms to handle and represent time dependence in DL nets and the extremely high number of parameters that need to be set and/or learned.
To be fair to DL nets, some of the arguments I mentioned above (need for unbiased data for example) also apply to all other data driven learning models in AI. However, most of the other interesting models for time-series analysis and forecasting are not so over-hyped yet and therefore still deserve more attention in my eyes to understand their full potential.
I will discuss the use of other AI models for time-series forecasting in follow up posts.
Hong Kong, April 24, 2017