You are on page 1of 9

Project Info:

Project Title: ​Markovchain package

Project short title: ​This project aims to extend the current functionality and
capabilities of the R package ‘​markovchain​’ ​in order to provide statisticians a more
functional tool to perform analysis of stochastic projects related to Markov chains
(MCs).

URL of project idea -


https://github.com/rstats-gsoc/gsoc2017/wiki/The-markovchain-package

Bio of Student:
I am a computer science student studying in Indian Institute of Technology
(Banaras Hindu University), Varanasi, India. I have relevant coding experience in R
that would be needed to build this package. I have previously worked on datasets
like Hubway visualization challenge, movielens dataset. In addition, I have been
working on building a shiny web application using the Rgbif package currently
(Github link below). A similar visualization application can be a part of the proposed
package. In addition to R programming, I am also familiar with c++ programming
and the Rcpp package. I have implemented all the assignments and projects in the
data structures and algorithms course using c++ language and hence a command
on c++ also. Along with this I am also familiar with git/github (version control).

Academics:
I have been enrolled in many MOOCs (Massive Open Online Course) related to
data science in R language, Data science specialization courses on Coursera
among others. In total, I have taken three courses in computer programming along
with Data structures course. Currently I am attending an Algorithms course and an
Artificial Intelligence course at my college. I have also taken a statistics and
probability course in my institute. I have implemented a Hidden Markov Model and
then used the viterbi algorithm to perform part of speech tagging as an assignment
in one of the courses, hence I am familiar with stochastic processes. link -
https://github.com/vandit15/AI-lab-codes​ .

Previous Works References:


1. Data Analysis using R:
● Analysis of Hubway visualization dataset.
● Movielens dataset (Applying recommender system)
● Kaggle Titanic solution in R (random forest algorithm)
● Analysis of iris dataset.
Link – ​https://github.com/vandit15

2. Built a website for training and placement cell IIT(BHU) using django for
backend, MySQL for database management and materializecss for front-end
designing.
Link - ​https://github.com/vandit15/IITBHU-TPO-site

3. Built a tank game using pygame and python2.7


Link - ​https://github.com/vandit15/tank

4. Building a signature verification system using Convolutional Neural


Networks(CNN) and then applying similarity functions to detect forgeries
using KL-KS test (in progress).
Link - ​https://github.com/DeepVisionaries/Signature-Verification

Contact Information:
Student Name: ​Vandit Jain
Student postal address: ​138-A, R K Puram, near Gauri hospital, Kota, Rajasthan,
India (pin code-324005)
Phone number: ​(+91) 8764340070, 7233013328
Email: ​jainvandit15@gmail.com​, ​vandit.jain.cse15@itbhu.ac.in

Student affiliation:
Institution: ​Indian Institute of Technology (Banaras Hindu University), Varanasi,
India
Program: ​Bachelor of Technology (B.Tech) in Computer Science and Engineering.
Stage of completion: ​Part 2 (4​th​ Semester)
Contact to verify:
Dr. Rajeev Shrivastava
Professor, Department of Computer Science and Engineering
Email: ​rs.cse@iitbhu.ac.in

Schedule Conflicts:
I do not find myself working in any kind of internship/part time jobs/other jobs during
summer of 2017. I have no conflicts with the GSoC schedule. I am willing to invest
whole of my three months towards the success of my GSoC project.

Mentors:
Mentor-1 - ​Sai Bhargav Yalamanchi
Mentor-2 - Giorgio A. Spedicato
I established contact with the mentors after solving the tests for the project. We
have been in contact since then.
Coding plan and Methods:
The project at its heart is to improve the markovchain package, improve run time of
current functions and add more functions.
Optimisation of current functions – ​For optimisation I would search for
opportunities in the current package where I can improve run-time. This would take
overviewing the code. R has packages such as microbenchmark among others that
can be used to detect bottlenecks in the code. Looping in R is quite a slow process.
After detection of slow running parts of code using above methods, the task is to
fasten the process. If I find a slow running loop, I would replace it with apply family
of functions. This would considerably improve the running time. All functions written
in R that are slow can be written in c++ using Rcpp package. This also improves
running time considerably. Fine-tuning current functions also includes improving
current documentation and unit-testing according to changes made. The package
also uses RcppParallel. I intend to use it wherever possible.

Continuous-time Markov Chain – ​Considerable amount of work has been done


on CTMCs in the package. I will implement more functions which includes:
● Getting probabilities of states at any given time t. For a CTMC object of the
S4 class already implemented, I would add function to evaluate P(t).
● Functions to get the Generator Matrix and Transition Diagram Plotting.
● Function for expected hitting time (from some state j to i).
● Implementing imprecise CTMCs, using ideas from
https://arxiv.org/pdf/1611.05796.pdf​ ​https://arxiv.org/pdf/1702.07150.pdf​ .
Implementation of basic infrastructure and methods – algorithms for
computing lower expectations of functions that depend on a state at any
number of finite points.
These functions would be written so as to work according to current implemented
code. Studying about imprecise CTMCs and then implementing the functions would
take considerable time in the project.
Higher Order Multivariate MCs – ​A considerable amount of time has been given
in implementing HOMMCs in the previous year of GSoC (2016). Continuing the
work, I will write functions to generate random sequence from chain object and
initial conditions. This is just to add more features to the the present work. I will
continue the methods used to implement this functions. Maintaining the S4 classes
for HOMMCs in R and the fitting functions are implemented in Rcpp. Package main
vignette will be updated as well.

Building important graphics features – ​The graphics related functions are


minimal in the package currently. For visualisation of large and complex
markovchains, a plotting feature would enable users to view the transition
probability diagram for communicating classes. R packages like ggplot2, plotly
would be helpful to complete this task. I have some experience in visualisation
using R and it will be helpful in completing this task.

Joint Distributions of the number of visits for Finite-State MCs – ​This function
when implemented is expected to return a pdf of the number of visits to the various
states of the DTMC during the first N steps or before the Nth visit.

Stability tests for a Markov Chain – ​The idea presented in this


(​https://arxiv.org/pdf/1608.03257.pdf​)​ paper about stability of markovchains is worth
implementing. The paper describes a simulated annealing based approach which
can be added as a feature. For example, in queuing applications Markovchain
offers the guarantee that service has been sufficiently provisioned to cope with the
load imposed on the network in the long run.

Markovchain ​ ​Statistics - ​Currently computation of only the first passage time has
been implemented. The pdfs for each of these can be obtained by solving a set of
equations with similar forms but varying initial conditions for a ‘minimal’ solution. I
will be spending time building two functions that perform these tasks: Extending the
first passage time pdf computation for a set of states A and the expected first
passage time. Second is function that takes two disjoint sets A, B, the pdf which
takes an initial state i and tells you the probability that A is hit before B. Functions
would be implemented using the idea given in
(​http://www2.math.uu.se/~takis/L/McRw/mcrw.pdf​)​. Proper unit-testing and
documentation using roxygen2 would be an important part.

Computation of Rewards – ​This is something new to the markovchain package. It


is basically implementing a set of functions that give the expected reward before a
set of states is hit. Also, for Markov chains possessing a positive recurrent state,
given a bounded reward function, the time-average reward function can be easily
computed.

Timeline:
According to the coding plan, the timeline is set so as to implement considerable
deliverables at the time of both the mid-term evaluation (June 30​th​ , July 28​th​ ) and
the final evaluation (29​th​ August) .

Pre community Bonding Period (April 3​rd​ – May 4​th​ ) - ​I would invest this period
of time in improving my knowledge about markovchain through sources one of them
definitely being ​Dobrow, Introduction to Stochastic Processes in R. I have took a
basic course in statistics and probability and also implemented a hidden
markovchain model ​as an assignment in Artificial Intelligence course which would
help. I would also brush up my R skills especially Rcpp.

Community Bonding Period (May 5​th​ – May 29​th​ ) – ​This period is important as
this time would be invested in discussing about the structure about the proposed
functions for the project. Also I would go through the whole package as currently I
have read very few functions from the package (during solving the tests). I would at
least write pseudo code or summary for some functions (after studying the papers
referred to) and also start implementing them if time permits. Also I intend to
perform optimisation related work in this period.

Coding Period​ -
Continuing from the work done in Community Bonding Period coding period would
be divided as follows:
30​th ​May - 4​th​ June​ – Complete optimization related work carrying on from the
community bonding period.
5​th​ June - 12​th​ June – ​Discuss with mentors and write pseudo code for functions
related to CTMCs. for p(t) read page 301 of book

13​th​ June – 25​th​ June – ​Implement the pseudo code decided.


26​th​ June - 28​th​ June​ – Build unit-testing for the above implemented functions.
29​th​ June - ​Write documentation for the new functions written.

30​th​ June - ​This marks the point for deadline of ​1st​


​ evaluation.

1​st​ July - 4​th​ July – ​Discuss with mentors and write pseudo code for implementing
stability tests.
5​th​ July​ - ​12​th​ July – ​Implement the discussed functions in the previous week.
13​th​ July - 16​th​ July ​ - Discuss with mentors and write pseudo code for functions
related to markovchain statistics.
17​th​ July - 24​th​ July – ​Implement the discussed functions in the previous week.
25​th​ July - 26​th​ July – ​Unit testing of functions implemented after first evaluation.
27​th​ July​ – Write documentation for the implemented functions.

28​th​ July – ​This marks the point for deadline of ​2nd​


​ evaluation.

29​th​ July - 30​th​ July - ​ Discuss with mentors and write pseudo code for
improvement in graphics for the package.
31​st​ July - 4​th​ August – ​Implement the graphics functions for the package and
update documentation.
5​th​ August - 9​th​ August - ​ Discuss with mentors and write pseudo code for
proposed functions related to HOMMCs.
10​th​ August - 13​th​ August​ - Implement the discussed functions.
14​th​ August - 15​th​ August​ – Unit testing and documentation for modified and newly
implemented functions.
16​th​ August - 20​th​ August​ – Implementing miscellaneous functions related to
computation of rewards and number of visits for finite state Mcs.
21​st​ August - 28​th​ August​ – Revising all modifications, updating documentation,
unit-testing and bug-fixing.

29​th​ August - ​ Final evaluation of the project.

Management of the coding project:


● I would try to maintain close contact with the mentors discussing ongoing
work and future steps.
● I will try to maintain unit testing in order to check that package functionalities
are preserved.
● Background process of documentation and vignette building would be
maintained.
● Code would be hosted on github from the beginning of the project.

Tests:
For the markovchain package, I had to submit pull requests for issue #106 and
#115.

Issue#106 pertains to NA handling for functions markovchainFit and


markovchainListFit.

Issue #115 pertains to round off error in steadyStates function. This is the link to my
fork.

https://github.com/vandit15/markovchain

You might also like