You are on page 1of 36

Recent Researches Overview

Presentation

Nam Le
Email: namlehai90@gmail.com

Nam Le 09 November, 2015

Short Resume

2015, MSc (by research) in Computer Science, Le Quy Don Technical University
Research interests:
+ Natural Computing theory and applications:
- Genetic Programming, Genetic Algorithms, Simulated Annealing
+ Computational Biology:
Applying natural computing to solve NP-hard problems in biology:
- Gene-mapping: Recovery gene makers after cutting DNA into segments
(Double, Partial, Simplified Partial Digest problem).
- (In addition) Phylogenetic tree reconstruction.
Research experience:
+ Network Security Lab, Le Quy Don Technical University: 2 small projects
- Adaptive operators for genetic programming
- Genetic programming for network intrusion detection
+ Hanu R&D (Professor Hoai is the director), 2 prospective researches:
- Robustness of GP to noise
- Stochastic fitness in GP
+ Self-study: Physical mapping
Nam Le 09 November, 2015
2

Bioinformatics researches:
(Independently)
1. Heuristics for Physical mapping
implement genetic algorithms for Simplified
Partial Digest Problem.

Nam Le 09 November, 2015

Restriction mapping problem


For a set X of points on the line, let X
= { | x1 - x2| : x1, x2 X }
denote
the multiset of all pairwise distances
between points in X. In the restriction
mapping problem, a subset E X (of
experimentally obtained fragment lengths)
is given and the task is to reconstruct X
from E.

Full Restriction Digest


DNA at each restriction site creates multiple
restriction fragments:

Is it possible to reconstruct the order of the fragments from the


sizes of the fragments {3,5,5,9} ?

Full Restriction Digest: Multiple Solutions


Alternative ordering of restriction fragments:

vs

Three different problems


1. the double digest problem
DDP
2. the partial digest problem PDP
3. the simplified partial digest
problem SPDP

Double Digest Mapping

Use two restriction enzymes; three full digests:

1. a complete digest of S using A,


2. a complete digest of S using B, and
3. a complete digest of S using both A and B.

1.

Computationally, Double Digest problem is more complex


than Partial Digest problem

Double Digest Problem


Input: dA fragment lengths from the complete digest with
enzyme A.
dB fragment lengths from the complete digest with
enzyme B.
dX fragment lengths from the complete digest with
both A and B.
Output: A location of the cuts in the restriction map for the
enzyme A.
B location of the cuts in the restriction map for the
enzyme B.

Double Digest: Multiple Solutions

Double digest
The decision problem of the DDP is NP-complete.
All algorithms have problems with more than 10
restriction sites for each enzyme.
A solution may not be unique and the number of solutions
grows exponenially.
DDP is a favorite mapping method since the experiments
are easy to conduct.

Partial Restriction Digest

The sample of DNA is exposed to the restriction enzyme for


only a limited amount of time to prevent it from being cut at all
restriction sites

This experiment generates the set of all possible restriction


fragments between every two (not necessarily consecutive)
cuts

This set of fragment sizes is used to determine the positions of


the restriction sites in the DNA sequence

Multiset of Restriction Fragments

We assume that
multiplicity of a
fragment can be
detected, i.e., the
number of restriction
fragments of the
same length can be
determined (e.g., by
observing twice as
much fluorescence
intensity for a double
fragment than for a
single fragment)

Multiset: {3, 5, 5, 8, 9, 14, 14, 17, 19, 22}

Partial Digest Fundamentals


X:

the set of n integers representing the location of all cuts in


the restriction map, including the start and end

n:

the total number of cuts

X:

the multiset of integers representing lengths of each of the


fragments produced from a partial digest

Partial Digest Problem: Formulation


Goal: Given all pairwise distances between points on a line,
reconstruct the positions of those points

Input: The multiset of pairwise distances L, containing


n(n-1)/2 integers
Output: A set X, of n integers, such that X = L

PDP analysis
No polynomial time algorithm is known for

PDP. In fact, the complexity of PDP is an open


problem.
S. Skiena devised a simple backtracking
algorithm that performs well in practice, but
may require exponential time.
This approach is not a popular mapping
method, as it is difficult to reliably produce all
pairwise distances between restriction sites.

Genetic Algorithm for PDP


Hayedeh Ahrabian et al Genetic algorithm solution for
partial digest problem, International Journal of
Bioinformatics research and applications.
Based on the observation that:

So they restrict the space of representation to some d(i)


such that d(i) and d(max) d(i) belong to D.

Genetic Algorithm for PDP


Hayedeh Ahrabian et al Genetic algorithm solution for
partial digest problem, International Journal of
Bioinformatics research and applications.
Based on the observation that:

So they restrict the space of representation to some d(i)


such that d(i) and d(max) d(i) belong to D.

Simplified partial digest problem

Given a target sequence S and a single


restriction enzyme A. Two different
experiments are performed
on two sets of copies of S:
In the short experiment, the time span is chosen
so that each copy of the target sequence is cut
precisely once by the restriction enzyme.

In the long experiment, a complete digest of S by


A is performed.

SPDP
Let = {1, . . . , 2N } be the multi-set of all
fragment lengths obtained by the short
experiment, and
let = {1, . . . , N+1} be the multi-set of all
fragment lengths obtained by the long
experiment,
where N is the number of restriction sites in S.
Here is an example: Given these (unknown)
restriction sites (in kb): 2 8 9 13 16
We obtain % = {2kb, 6kb, 1kb, 4kb, 3kb}.

Natural Computing researches:


(under Professor. Hoai Nguyen Xuan)
1. Robustness of Genetic Programming to
Noise.

Nam Le 09 November, 2015

21

Topic description

Statement of problem
Objectives
Research question
Research Method
Results

Nam Le 09 November, 2015

22

The Problem
Real-world data always has noise
Noisy data is one of the main cause of overfitting in any learning mechanisms
Noisy data makes EAs in general does not
converge well to optimal point.

Nam Le 09 November, 2015

23

Objectives
Main objective:
To visualize the robustness of GP and the impact of
noise on over-fitting property of GP.

Sub-objectives:
1. To build and contribute standard noisy data sets to
GP benchmark problems.
2. To classify the hardness / difficulty level of those
problems based on the robustness of GP to noise.

Nam Le 09 November, 2015

24

Research questions
1. Analyse and propose an effective noisegenerating model.
2. How effectiveness of GP learning model can
be affected by different types of noise and
noise level? And which noise level makes GP
be over-fitted?
3. Which type of problems GP can be good?

Nam Le 09 November, 2015

25

Research Methods - approach


Question 1:
Choose the most recent noise model but still has mathematical
logic property.
[2010_Dav] is the most popular noise model for supervised
learning.
Make noise in 3 level (10%, 30%, 50%) in 3 manner (noise for x,
for y, and for both x and y).

Nam Le 09 November, 2015

26

Research Methods - approach


Question 2:
Experimenting the proposed noise model on training and testing
set with different types of noise.
Analyzing the results to figure out the effect of noise on robustness
of GP, on over-fitting property of GP.

Question 3:
Experimenting on BVGP (Bias / variance GP)
Rank problems (Benchmarks and Real-UCI) based on the
difficulty for GP to solve

Nam Le 09 November, 2015

27

Results
Question 1:
Distribution is invariant after adding noise.
Generating noisy data for experiment

Question 2:
Collect results from experiment.
Affect of noise
Over-fitted problems
Nam Le 09 November, 2015

28

Errors of best fittest

Nam Le 09 November, 2015

29

Over-fitted problems

Nam Le 09 November, 2015

30

Natural Computing researches:


2. Stochastic fitness in Genetic Programming

Nam Le 09 November, 2015

31

Topic description

Statement of problem
Objectives
Research question
Research Method
Results

Nam Le 09 November, 2015

32

The Problem
Real-world data always has noise =>
uncertainty
Fitness functions in EA always accompanied by
noise
Noisy fitness function could result in a high
fitness being mistakenly assigned to low
individual, and conversely.
+ f(x1) > f(x2) does not mean trueF(x1) >
trueF(x2) in noisy environment.
Nam Le 09 November, 2015

33

Objectives
Main objective:
To insert uncertainty into stochastic selection in GP.

Sub-objectives:
1. To figure out whether a not Stochastic GP can solve
the negative effects caused by noise on 1st phase (overfitting).

Nam Le 09 November, 2015

34

Research Methods - approach


Inserting stochastic:
Assume fitness function is a distribution with mean is true fitness,
and unknown variance.
Using bootstrapping technique to formulize the bias/variance trade
off.
Use hypothesis testing to compare two fitness (two distribution).

Run SFGP on problems those are over-fitted


after adding noise in the 1st phase.

Nam Le 09 November, 2015

35

Results
Question 1:
Distribution is invariant after adding noise.
Generating noisy data for experiment

Question 2:
Collect results from experiment.
Affect of noise
Over-fitted problems
Nam Le 09 November, 2015

36

You might also like