You are on page 1of 15

National Research University Higher School of Economics

Faculty of Business Informatics


School of Software Engineering
Software Management Department









A Program for Kolmogorov Complexity Identification of Time Series







Word count: 3036

















Moscow
2014
Student: Artem Stafeev
Group: 471SE
Argument Consultant: Prof. Mikhail V. Ulyanov, PhD
Style and Language Consultant: Tatiana A. Stepantsova

2

ABSTRACT

The paper proposes an approach to time series research based on the Kolmogorov
complexity identification, which could be potentially used to improve the quality of data
analysis and processing. The Kolmogorov complexity is a numeric characteristic of time
series which represents the ratio of an original symbolic code of the data to a compressed
code. The key issue of the proposed approach is a way of constructing a symbolic code that is
obtained by using bicriterial method. The method uses two variables while constructing
symbolic code: the reliability of encoded segments and the consistency of empirical and
histogram distribution functions. The compression of the obtained code could be performed
by standard algorithms such as RAR and ZIP. The paper also suggests ideas for future
research regarding clusterization issues of time series.




















3

CONTENTS

INTRODUCTION ...................................................................................................................... 4
1. RELATED WORK ............................................................................................................. 5
2. THEORETICAL BASIS..................................................................................................... 5
2.1. BICRITERIAL METHOD.............................................................................................. 7
2.2. RELIABILITY OF ENCODED SEGMENTS .............................................................. 8
2.3. DISTRIBUTION CONSISTENCY ............................................................................... 9
3. COMPRESSION PROCEDURE ....................................................................................... 9
4. KOLMOGOROV COMPLEXITY EVALUATION ...................................................... 10
5. TIME-SERIES CLUSTERIZATION .............................................................................. 11
6. METHODOLOGY OF THE RESEARCH ..................................................................... 12
7. ANTICIPATED RESULTS ............................................................................................. 12
CONCLUSION ......................................................................................................................... 13
APPENDIX 1: Histrogram constructed using bicriterial method uniformal method. ......... 14
APPENDIX 2: The dependency of Quality criterion Q from the number of segments. ...... 14
BIBLIOGRAPHY ..................................................................................................................... 15















4

INTRODUCTION

During analysis and evaluation of time series it is essential to maximize the accuracy
of prediction of mathematical models such as regression, clusterization and confidence
interval construction. The research paper of Mikhail V. Ulyanov and Yuri G. Smetanin "The
approach to the characterization of Kolmogorov complexity of time series based on the
character descriptions suggests building a clustering space of time series in order to improve
forecasting efficiency. The proposed approach will allow classifying time series in the
dimension space of numerical characteristics such as the Kolmogorov complexity.
The Kolmogorov complexity is a characteristic of a string of characters which
reflects the complexity (in terms of the length of a record) of the algorithm and the input, in
other words, the length of formal description string (Vereschagin, N. K. & Uspensky, V.A. &
Shen, 2013, p. 101). Forecasting level is inversely proportional to the Kolmogorov
complexity value (Smetanin, Y.G. & Ulaynov, M.V., 2013, p. 3), so the less the Kolmogorov
characteristic is, the higher forecasting opportunities are.
The key issue for the Kolmogorov complexity is the way of data segmentation and
character encoding procedure. The existing approaches (Sturges, 1926) are not universal or
mathematically correct and may lead to data manipulation. The quality criterion for such a
procedure and the mathematical proof was introduced in the research by Mikhail V. Ulaynov
and the proposed method is called bicriterial.
The bicriterial method utilizes two criteria: the reliability of encoded segments and the
consistency of empirical and histogram distribution functions. The reliability is calculated
iteratively given confidence interval for a group average value. The consistency value is
calculated as type I error of Kolmogorov criterion or, in other words, the quality of
approximation of the empirical function (Smetanin, Y.G. & Ulaynov, M.V., 2013, p. 4).
Finally, the total quality value is calculated as a product of two criteria. The point is that the
criteria are inversely proportional: the less the first value is the higher the second is.
Consequently, the subtask of the researcher in this case is to find Pareto-optimal values of
both criteria.
The evaluation of Kolmogorov complexity is performed as the last step of the
algorithm: the length of compressed string is divided by the length of the original string. The
method is not sensitive to the chosen compression algorithm; consequently, the researcher
could use RAR, L, ZIP and other effective compression methods.
5

The paper is organized as follows: Section 2 gives an overview of the bicriterial
method and quality criteria. Compression procedure definition and Kolmogorov complexity
evaluation are contained in Section 3 and 4, respectively. Time-series clusterization is
described in Section 6. Section 7 presents anticipated results.
1. RELATED WORK

The research is based on two papers by Mikhail V. Ulaynov and colleagues:
"Bicriterial method of constructing and assessing the quality of histograms by V. N.
Petrushin, M. V. Ulyanov, I. A. Chertihina, E. V. Nikulchev; "The approach to the
characterization of Kolmogorov complexity of time series based on the character
descriptions Y.G. Smetanin, M. V. Ulyanov.
In the first paper, the authors undermine Sturges conception (Sturges, 1926) of
uniformly partitioned segments. The basic idea of the existing method is to split observations
between equal groups, so for 100 examples there will be log
2
100 groups with 10 observations
within each segment. The key issue while dealing with the uniform pattern is that it does not
account for extreme and unusual cases such as bimodal distribution (see Appendix 1).
The other approaches include Scotts formula (Scott, 1979) and interquartile rank
method (D. Freedman & P. Diaconis, 1981). The main disadvantages of the proposed
methods are empirical basis and violation of traditional statistical rules (Petrushin, V.N. &
Ulaynov, M.V. & Chertihina, I.A. & Nikulchev, E.V., 2012).
The core element of the proposed research is the bicriterial method, which has proven
its efficiency and applicability on real data and experiments (see Appendix 2). As an outcome
of graduation paper the robust algorithmic implementation of the method will be developed
and tested on real data.
2. THEORETICAL BASIS

The algorithm of Kolmogorov complexity identification could be represented in the
following steps:
Step/Procedure Outcome
Data download Raw data
The bicriterial method data processing Encoded character string
The evaluation of a quality of the obtained If the quality of obtained code is satisfactory
6

symbolic code due to quality criteria then move further;
otherwise, return to the previous step and
downgrade the criteria.
Obtained code processing The Kolmogorov complexity value is
obtained by division of the length of original
string by the length of compressed string.

Time series data could be downloaded manually by the researcher or automatically
using program settings. The most common data sources include financial markets data,
commodities indexes, weather forecasts, demographic numbers and other frequently
changing data. The raw arrays which include only numeric values and corresponding date are
loaded to the program at the first step.
Secondly, the encoded string character is obtained after applying the bicriterial
method. The main issue for the coding procedure is the way of construction of the outcome
string. The proposed bicriterial method uses mathematically proven approach instead of
existing empirical methods (Petrushin, V.N. & Ulaynov, M.V. & Chertihina, I.A. &
Nikulchev, E.V., 2012).
The obtained string is assessed using the approximation quality of the empirical
function or in other words type I error for the Kolmogorov criterion. The initial numeric
value of the criterion is set manually by the researcher at the very first step of the algorithm.
After the bicriterial method iteration, the value could be downgraded if the code does not
meet quality requirements; hence, the method could iteratively change the criteria based on
particular data that it is dealing with.
Finally, the obtained code is compressed by using external libraries methods such as
RAR, ZIP and L. The Kolmogorov value is a ration of the length of the compressed string to
the original code and it is calculated as the last stage of the program.
The key step of the algorithm is the bicriterial method which will be described in
greater detail in the following section.




7

2.1. BICRITERIAL METHOD

The bicriterial method is a procedure which involves the following steps (Petrushin,
V.N. & Ulaynov, M.V. & Chertihina, I.A. & Nikulchev, E.V., 2012, p. 6):
1) Setting quality value Q which is a product of two variables and , where is a
reliability of encoded segments and is a consistency of empirical and histogram
distribution functions. It is also worth to mention that the coefficients and are
inversely proportional, that is why the researcher can set weight w
1
for one of
criterion and, consequently, weight w
2
for the second criterion will be 1-w
1
.
2) Calculation of initial reliability value of segments partitioning
j
=

based on the
total value .
3) Calculation of the length of each group is based on the following inequality:



Where

is an average value of a segment,

and

are
minimal and maximal values in a segment, respectively; j is a number of a
segment,

is an unbiased estimation of group variance.
4) Histogram distribution construction and criteria calculation according to the
Kolmogorov formula (Lagutin, M. B., 2007):


5) If the quality of approximation does not meet set up requirements, the algorithm
goes to step 2 with downgraded criteria.
6) Calculation of quality value Q= * of obtained symbolic code.
One of the main advantages of the bicriterial method is inverse proportionality of two
criteria, which allows comprehensive assessment of segments quality. In addition, standard
tools for initial data processing cannot evaluate the quality of the result, while the bicriterial
method provides researcher with the functionality for setting quality values. Finally, the
bicriterial method increases overall forecasting accuracy and requires little computational
power.
8

According to experiments (Petrushin, V.N. & Ulaynov, M.V. & Chertihina, I.A. &
Nikulchev, E.V., 2012), the bicriterial method allows to improve quality criteria by
approximately 10 percent in comparison with the uniformal algorithm.
K number of segments

Q
11 (uniformal) 0,963 0,407 0,392
11 (bicriterial method) 0,989 0,496 0,491
Table 1 Quality values for the uniformal and the bicriterial method
(Petrushin, V.N. & Ulaynov, M.V. & Chertihina, I.A. & Nikulchev, E.V., 2012)
In summary, the bicriterial method consists of two inversely proportional criteria: the
reliability of encoded segments and distribution consistency; the calculation of each criterion
is performed iteratively and the first value to be calculated while time series processing is the
reliability of encoded segments.
2.2.RELIABILITY OF ENCODED SEGMENTS

The first quality criterion in bicriterial algorithm is a reliability of encoded segments,
a criterion for assessing the reliability of mean group values (Petrushin, V.N. & Ulaynov,
M.V. & Chertihina, I.A. & Nikulchev, E.V., 2012).
Given target quality value, it is possible to calculate reliability based on the proposed
inequality:



If one of the conditions stated in the inequality is violated, then the algorithm should
remove an element from the segment, in other words it shortens confidence interval of mean
value.
The main goal of the method is to obtain a correct confidence interval for group mean,
so if the value exceeds target value the algorithm will remove extra elements to meet interval
requirements.
However, the most sophisticated part of the procedure is the last segment and the last
confidence interval, consequently. The algorithm has no opportunity to remove elements to
the next segment that is why it will merge the whole segment with the previous one. This
approach allows varying the number of segments in order to meet the reliability requirements.
9


Picture 1 Mean value and the bicriterial method
The reliability of encoded segments is the first step of the bicriterial method; however,
the algorithm could make a step back and recalculate the value if the encoded sting does not
meet distribution consistency requirement.
2.3.DISTRIBUTION CONSISTENCY

The consistency value is calculated as type I error of the Kolmogorov criterion or in
other words the approximation quality of the empirical function (Smetanin, Y.G. & Ulaynov,
M.V., 2013). Analytical formula of the criterion is the following.


The formula compares two distributions in terms of Kolmogorov metrics which is the
difference between maximal values of each histogram or distribution functions. If the quality
of approximation does not meet set up requirements, the algorithm goes to step 2 of the main
procedure with downgraded reliability criteria.
The principle advantage of the Kolmogorov metric is the opportunity to compare to
functions in terms of maximal difference values, so the method does count extreme cases in
comparison with mean difference approach.
The distribution consistency calculation is performed as the final step of the bicriterial
method and the main objective of this procedure is to verify the reliability requirements for
the encoded sequence. If the calculated value is less than required by the initial condition then
the algorithm will step back with downgraded quality value, so the procedure is performed
iteratively, which allows to vary criterion as the algorithm performed.
After the distribution consistency value is calculated the bicriterial method finishes its
work and sends the outcome (encoded string) to the compression module.
3. COMPRESSION PROCEDURE

Compression procedure is performed using standard compressing algorithms
embedded in external libraries. The main outcome of this stage of the method is to obtain
compressed code.
10

Recommended algorithms may include:
RAR
ZIP
L
Although mentioned algorithms have different approaches and principles towards
compression, there is no difference for Kolmogorov criterion what method to choose. The
choice of the compression algorithm is also determined by development environment
capabilities.
Different time series may have multiple measurement accuracy; for instance, the
number of decimal places per value may differ due to formatting issues. Therefore, the
program cannot explicitly compress the original sequence and the bicriterial is a preparatory
step for compressing procedure, which is conceptually designed to deal with unified types of
data.
The outcome of the compression procedure is transferred to the Kolmogorov
complexity evaluation method which returns final characteristic of time series.
4. KOLMOGOROV COMPLEXITY EVALUATION

Kolmogorov complexity is determined as a ratio of the length of the original character
string to a compressed string. The mathematical meaning of this ratio is how well the string
character could be compressed, so the Kolmogorov value is inversely proportional to
prediction applicability.
Kolmogorov value=


Where L compressed the length of compressed string and L original is the length of
original string code.
In fact, the Kolmogorov value demonstrates effectiveness of the compression attribute
of time series. The compression feature and symbolic code could be interpreted as a measure
of tendency; for instance, the change in the symbolic sequence signals that there is the pattern
variation at the particular time moment of time series. That is why the researcher could
diagnose if some percent variation is a trend change or it is just a fluctuation from the mean
value.
The improved Kolmogorov criterion may also include not only symbolic code based
on time series values, but also the code which indicates the pattern change; hence, the
11

proposed pattern criterion could be also included in the cluster space as a dimension in order
to improve forecasting.

5. TIME-SERIES CLUSTERIZATION

One of the possible applications of the program could be time series clusterization
based on the Kolmogorov complexity. Depending on the obtained Kolmogorov value, it is
possible to construct the space of time series and apply different predictions techniques and
methods (Smetanin, Y.G. & Ulaynov, M.V., 2013). For instance, possible clusters could be
financial data, social, macroeconomics indicators, commodities and other fields.
The Kolmogorov complexity could be potentially one of the possible metrics for such
cluster space and one the results of the graduation project is to provide data mining specialists
and researchers with suitable tool for initial data analysis.
Future research in the clustering field may include a construction of different
coordinate axes, because the more dimensions determine clusters the better is overall
accuracy. The other issue might be introducing a proper coordinate function for calculating a
distance between time-series; for example, the researcher may be interested in how far
clusters are located on the coordinate axis.
One of the possible program use case scenarios for the researcher is to load data to the
program (which will be open source available) for the processing. As the result, the particular
time series will be placed on the cluster space providing the researcher with the information
on what data mining tools might be applicable for this time series. Hence, the developed
program could be used as the tool for initial data processing.

Picture 2 Cluster space for time series
12

6. METHODOLOGY OF THE RESEARCH

The research includes three main stages. The first part is a development of the
bicriterial algorithm and integration of the compression method. The software tool for
analysis and evaluation will be created at the first stage of the research. The software includes
algorithmic procedure, user interface and documentation for the researcher on the program
futures.
Secondly, the evaluation of the program is performed by using open data of different
data sources such as financial indexes, commodities time series, macroeconomic indicators
and other web based publicly accessible sources. The variety of data types is determined by
the objectives of the research which is to prove the hypothesis of the bicriterial method
efficiency regardless time series structure.
Cluster space construction is proposed as a final step of the research. The next steps
for clusters will depend on the performance of the program regarding similar time series,
because homogeneous data are supposed to have identical Kolmogorov values. If the
experiments provide the correctness of the Kolmogorov method then the future clustering
research may continue and the Kolmogorov criterion might be one the dimensions for the
cluster space.
The outcome of the research will be deployed to Internet sources providing the tool
for the future research for the evaluation of time series.
7. ANTICIPATED RESULTS

As the result of the research the following outcome is expected:
1. The implementation of the bicriterial method with two quality criteria.
2. The implementation of the compressing procedure of the obtained symbolic code.
3. The comparison of the implemented bicriterial method with the uniformal algorithm.
4. Testing and efficiency evaluation of the proposed method.
5. The evaluation of time-series based on Kolmogorov complexity.
The first part of the project is software development of the proposed bicriterial
method. The result of this stage is a robust user interface with time series download
functionality and efficient representation of the obtained symbolic code which might be saved
to the text file.
13

The compressing procedure is integrated using external libraries which provide
compressing capabilities. During the testing stage of the procedure it is planned to compare
results from different algorithms to verify the correctness.
The results of the development stage will be assessed in comparison with the
uniformal method. The proposed efficiency improvement of approximately 10 percent
(Smetanin, Y.G. & Ulaynov, M.V., 2013) will be proved during this stage.
Finally, different time series from financial, commodities demographic and other
sources will be evaluated by the program. As a result, the Kolmogorov value for
homogeneous time series is expected to be similar, though there is no measurement of
similarity yet, it can be also calculated after the testing stage.
CONCLUSION

Data processing and time-series analysis are the key fields in mathematical statistics
with many applications for practical problem solving. That is why it is essential to manage
and maintain time-series with effective tools.
The paper provides an overview of the new approach for time-series identification
using the Kolmogorov complexity value. The core element of Kolmogorov evaluation is the
bicriterial method that has already proved its efficiency on real data. The bicriterial method
provides the basis for more robust and mathematically correct analysis of time series.
As the result of this research, a software implementation of the Kolmogorov method
will be developed and tested on financial, commodity and macroeconomics data.
The anticipated results of this research might be useful for researchers who are
interested in effective data analysis. Time-series could be handled using proposed software
and then maintained with traditional statistical methods.
The future work may focus on additional metric of cluster space for time series
evaluation, which improves forecast accuracy and predictability.






14

APPENDIX 1: (a) Histogram constructed using the bicriterial method (b) the
uniformal method.

(a) (b)

APPENDIX 2: The dependency of Quality criterion Q on the number of segments.
k (G)
(V,G)
Q(V,G)
10 0,981 0,149 0,146
11 0,963 0,407 0,392
12 0,962 0,259 0,249
13 0,904 0,202 0,183
14 0,903 0,315 0,284
15 0,858 0,270 0,232
16 0,805 0,304 0,245
17 0,802 0,421 0,337
18 0,752 0,292 0,219
19 0,592 0,407 0,241
20 0,378 0,528 0,200



0
0,05
0,1
0,15
0,2
0,25
0,3
1 2 3 4 5 6 7 8 9 10 11
0
0,02
0,04
0,06
0,08
0,1
0,12
0,14
0,16
1 2 3 4 5 6 7 8 9 10 11
15

BIBLIOGRAPHY

1. Freedman, D. & Diaconis, P. (1981). On the histogram as a density estimator,
Heidelberg: Springer Berlin.
2. Lagutin, M. B. (2007). Illustratative mathematical statistics. Moscow: Binom.
3. Lyubushin, A. A. (2007). Analysis of systems for geophysical and environmental
monitoring. Moscow: Science.
4. Petrushin, V.N. & Ulaynov, M.V. (2010). Information Sensitivity of Computer
Algorithms. Moscow: FIZMATLIT.
5. Petrushin, V.N. & Ulaynov, M.V. & Chertihina, I.A. & Nikulchev, E.V. (2012).
Bicriterial method of constructing and assessing the quality of histograms (pp. 1-9).
Russian Academy of Sciences Journal Information Technologies and Computing
Systems.
6. Scott, D. W. (1979). On optimal and data-based histograms (pp. 605-610), Biometrika.
7. Smetanin, Y.G. & Ulaynov, M.V. (2013). The approach to the characterization of
Kolmogorov complexity of time series based on the character descriptions.
Available from: http://ecsocman.hse.ru/hsedata/2013/08/06/1290854518/7.pdf
(Accessed 20 January 2013).
8. Sturges, H. (1926). The choice of a class-interval (pp. 65-66), J. Armer. The American
Statistical Association.
9. Tarasov, I. E. (2011). On the choice of histogram intervals (pp. 181-184). Control
Systems and Information Technology, 2.1 (44).
10. Vereschagin, N. K. & Uspensky, V.A. & Shen, A. (2013). Kolmogorov Complexity and
Algorithmic Randomness. Moscow: MCZNMO.

You might also like