Professional Documents
Culture Documents
based on the
total value .
3) Calculation of the length of each group is based on the following inequality:
Where
and
are
minimal and maximal values in a segment, respectively; j is a number of a
segment,
is an unbiased estimation of group variance.
4) Histogram distribution construction and criteria calculation according to the
Kolmogorov formula (Lagutin, M. B., 2007):
5) If the quality of approximation does not meet set up requirements, the algorithm
goes to step 2 with downgraded criteria.
6) Calculation of quality value Q= * of obtained symbolic code.
One of the main advantages of the bicriterial method is inverse proportionality of two
criteria, which allows comprehensive assessment of segments quality. In addition, standard
tools for initial data processing cannot evaluate the quality of the result, while the bicriterial
method provides researcher with the functionality for setting quality values. Finally, the
bicriterial method increases overall forecasting accuracy and requires little computational
power.
8
According to experiments (Petrushin, V.N. & Ulaynov, M.V. & Chertihina, I.A. &
Nikulchev, E.V., 2012), the bicriterial method allows to improve quality criteria by
approximately 10 percent in comparison with the uniformal algorithm.
K number of segments
Q
11 (uniformal) 0,963 0,407 0,392
11 (bicriterial method) 0,989 0,496 0,491
Table 1 Quality values for the uniformal and the bicriterial method
(Petrushin, V.N. & Ulaynov, M.V. & Chertihina, I.A. & Nikulchev, E.V., 2012)
In summary, the bicriterial method consists of two inversely proportional criteria: the
reliability of encoded segments and distribution consistency; the calculation of each criterion
is performed iteratively and the first value to be calculated while time series processing is the
reliability of encoded segments.
2.2.RELIABILITY OF ENCODED SEGMENTS
The first quality criterion in bicriterial algorithm is a reliability of encoded segments,
a criterion for assessing the reliability of mean group values (Petrushin, V.N. & Ulaynov,
M.V. & Chertihina, I.A. & Nikulchev, E.V., 2012).
Given target quality value, it is possible to calculate reliability based on the proposed
inequality:
If one of the conditions stated in the inequality is violated, then the algorithm should
remove an element from the segment, in other words it shortens confidence interval of mean
value.
The main goal of the method is to obtain a correct confidence interval for group mean,
so if the value exceeds target value the algorithm will remove extra elements to meet interval
requirements.
However, the most sophisticated part of the procedure is the last segment and the last
confidence interval, consequently. The algorithm has no opportunity to remove elements to
the next segment that is why it will merge the whole segment with the previous one. This
approach allows varying the number of segments in order to meet the reliability requirements.
9
Picture 1 Mean value and the bicriterial method
The reliability of encoded segments is the first step of the bicriterial method; however,
the algorithm could make a step back and recalculate the value if the encoded sting does not
meet distribution consistency requirement.
2.3.DISTRIBUTION CONSISTENCY
The consistency value is calculated as type I error of the Kolmogorov criterion or in
other words the approximation quality of the empirical function (Smetanin, Y.G. & Ulaynov,
M.V., 2013). Analytical formula of the criterion is the following.
The formula compares two distributions in terms of Kolmogorov metrics which is the
difference between maximal values of each histogram or distribution functions. If the quality
of approximation does not meet set up requirements, the algorithm goes to step 2 of the main
procedure with downgraded reliability criteria.
The principle advantage of the Kolmogorov metric is the opportunity to compare to
functions in terms of maximal difference values, so the method does count extreme cases in
comparison with mean difference approach.
The distribution consistency calculation is performed as the final step of the bicriterial
method and the main objective of this procedure is to verify the reliability requirements for
the encoded sequence. If the calculated value is less than required by the initial condition then
the algorithm will step back with downgraded quality value, so the procedure is performed
iteratively, which allows to vary criterion as the algorithm performed.
After the distribution consistency value is calculated the bicriterial method finishes its
work and sends the outcome (encoded string) to the compression module.
3. COMPRESSION PROCEDURE
Compression procedure is performed using standard compressing algorithms
embedded in external libraries. The main outcome of this stage of the method is to obtain
compressed code.
10
Recommended algorithms may include:
RAR
ZIP
L
Although mentioned algorithms have different approaches and principles towards
compression, there is no difference for Kolmogorov criterion what method to choose. The
choice of the compression algorithm is also determined by development environment
capabilities.
Different time series may have multiple measurement accuracy; for instance, the
number of decimal places per value may differ due to formatting issues. Therefore, the
program cannot explicitly compress the original sequence and the bicriterial is a preparatory
step for compressing procedure, which is conceptually designed to deal with unified types of
data.
The outcome of the compression procedure is transferred to the Kolmogorov
complexity evaluation method which returns final characteristic of time series.
4. KOLMOGOROV COMPLEXITY EVALUATION
Kolmogorov complexity is determined as a ratio of the length of the original character
string to a compressed string. The mathematical meaning of this ratio is how well the string
character could be compressed, so the Kolmogorov value is inversely proportional to
prediction applicability.
Kolmogorov value=
Where L compressed the length of compressed string and L original is the length of
original string code.
In fact, the Kolmogorov value demonstrates effectiveness of the compression attribute
of time series. The compression feature and symbolic code could be interpreted as a measure
of tendency; for instance, the change in the symbolic sequence signals that there is the pattern
variation at the particular time moment of time series. That is why the researcher could
diagnose if some percent variation is a trend change or it is just a fluctuation from the mean
value.
The improved Kolmogorov criterion may also include not only symbolic code based
on time series values, but also the code which indicates the pattern change; hence, the
11
proposed pattern criterion could be also included in the cluster space as a dimension in order
to improve forecasting.
5. TIME-SERIES CLUSTERIZATION
One of the possible applications of the program could be time series clusterization
based on the Kolmogorov complexity. Depending on the obtained Kolmogorov value, it is
possible to construct the space of time series and apply different predictions techniques and
methods (Smetanin, Y.G. & Ulaynov, M.V., 2013). For instance, possible clusters could be
financial data, social, macroeconomics indicators, commodities and other fields.
The Kolmogorov complexity could be potentially one of the possible metrics for such
cluster space and one the results of the graduation project is to provide data mining specialists
and researchers with suitable tool for initial data analysis.
Future research in the clustering field may include a construction of different
coordinate axes, because the more dimensions determine clusters the better is overall
accuracy. The other issue might be introducing a proper coordinate function for calculating a
distance between time-series; for example, the researcher may be interested in how far
clusters are located on the coordinate axis.
One of the possible program use case scenarios for the researcher is to load data to the
program (which will be open source available) for the processing. As the result, the particular
time series will be placed on the cluster space providing the researcher with the information
on what data mining tools might be applicable for this time series. Hence, the developed
program could be used as the tool for initial data processing.
Picture 2 Cluster space for time series
12
6. METHODOLOGY OF THE RESEARCH
The research includes three main stages. The first part is a development of the
bicriterial algorithm and integration of the compression method. The software tool for
analysis and evaluation will be created at the first stage of the research. The software includes
algorithmic procedure, user interface and documentation for the researcher on the program
futures.
Secondly, the evaluation of the program is performed by using open data of different
data sources such as financial indexes, commodities time series, macroeconomic indicators
and other web based publicly accessible sources. The variety of data types is determined by
the objectives of the research which is to prove the hypothesis of the bicriterial method
efficiency regardless time series structure.
Cluster space construction is proposed as a final step of the research. The next steps
for clusters will depend on the performance of the program regarding similar time series,
because homogeneous data are supposed to have identical Kolmogorov values. If the
experiments provide the correctness of the Kolmogorov method then the future clustering
research may continue and the Kolmogorov criterion might be one the dimensions for the
cluster space.
The outcome of the research will be deployed to Internet sources providing the tool
for the future research for the evaluation of time series.
7. ANTICIPATED RESULTS
As the result of the research the following outcome is expected:
1. The implementation of the bicriterial method with two quality criteria.
2. The implementation of the compressing procedure of the obtained symbolic code.
3. The comparison of the implemented bicriterial method with the uniformal algorithm.
4. Testing and efficiency evaluation of the proposed method.
5. The evaluation of time-series based on Kolmogorov complexity.
The first part of the project is software development of the proposed bicriterial
method. The result of this stage is a robust user interface with time series download
functionality and efficient representation of the obtained symbolic code which might be saved
to the text file.
13
The compressing procedure is integrated using external libraries which provide
compressing capabilities. During the testing stage of the procedure it is planned to compare
results from different algorithms to verify the correctness.
The results of the development stage will be assessed in comparison with the
uniformal method. The proposed efficiency improvement of approximately 10 percent
(Smetanin, Y.G. & Ulaynov, M.V., 2013) will be proved during this stage.
Finally, different time series from financial, commodities demographic and other
sources will be evaluated by the program. As a result, the Kolmogorov value for
homogeneous time series is expected to be similar, though there is no measurement of
similarity yet, it can be also calculated after the testing stage.
CONCLUSION
Data processing and time-series analysis are the key fields in mathematical statistics
with many applications for practical problem solving. That is why it is essential to manage
and maintain time-series with effective tools.
The paper provides an overview of the new approach for time-series identification
using the Kolmogorov complexity value. The core element of Kolmogorov evaluation is the
bicriterial method that has already proved its efficiency on real data. The bicriterial method
provides the basis for more robust and mathematically correct analysis of time series.
As the result of this research, a software implementation of the Kolmogorov method
will be developed and tested on financial, commodity and macroeconomics data.
The anticipated results of this research might be useful for researchers who are
interested in effective data analysis. Time-series could be handled using proposed software
and then maintained with traditional statistical methods.
The future work may focus on additional metric of cluster space for time series
evaluation, which improves forecast accuracy and predictability.
14
APPENDIX 1: (a) Histogram constructed using the bicriterial method (b) the
uniformal method.
(a) (b)
APPENDIX 2: The dependency of Quality criterion Q on the number of segments.
k (G)
(V,G)
Q(V,G)
10 0,981 0,149 0,146
11 0,963 0,407 0,392
12 0,962 0,259 0,249
13 0,904 0,202 0,183
14 0,903 0,315 0,284
15 0,858 0,270 0,232
16 0,805 0,304 0,245
17 0,802 0,421 0,337
18 0,752 0,292 0,219
19 0,592 0,407 0,241
20 0,378 0,528 0,200
0
0,05
0,1
0,15
0,2
0,25
0,3
1 2 3 4 5 6 7 8 9 10 11
0
0,02
0,04
0,06
0,08
0,1
0,12
0,14
0,16
1 2 3 4 5 6 7 8 9 10 11
15
BIBLIOGRAPHY
1. Freedman, D. & Diaconis, P. (1981). On the histogram as a density estimator,
Heidelberg: Springer Berlin.
2. Lagutin, M. B. (2007). Illustratative mathematical statistics. Moscow: Binom.
3. Lyubushin, A. A. (2007). Analysis of systems for geophysical and environmental
monitoring. Moscow: Science.
4. Petrushin, V.N. & Ulaynov, M.V. (2010). Information Sensitivity of Computer
Algorithms. Moscow: FIZMATLIT.
5. Petrushin, V.N. & Ulaynov, M.V. & Chertihina, I.A. & Nikulchev, E.V. (2012).
Bicriterial method of constructing and assessing the quality of histograms (pp. 1-9).
Russian Academy of Sciences Journal Information Technologies and Computing
Systems.
6. Scott, D. W. (1979). On optimal and data-based histograms (pp. 605-610), Biometrika.
7. Smetanin, Y.G. & Ulaynov, M.V. (2013). The approach to the characterization of
Kolmogorov complexity of time series based on the character descriptions.
Available from: http://ecsocman.hse.ru/hsedata/2013/08/06/1290854518/7.pdf
(Accessed 20 January 2013).
8. Sturges, H. (1926). The choice of a class-interval (pp. 65-66), J. Armer. The American
Statistical Association.
9. Tarasov, I. E. (2011). On the choice of histogram intervals (pp. 181-184). Control
Systems and Information Technology, 2.1 (44).
10. Vereschagin, N. K. & Uspensky, V.A. & Shen, A. (2013). Kolmogorov Complexity and
Algorithmic Randomness. Moscow: MCZNMO.