You are on page 1of 8

Hybrid Technique for Steganography-based on

DNA with N-Bits Binary Coding Rule


Ghada Hamed1 , Mohammed Marey2 , Safaa Amin El-Sayed3 and Mohamed Fahmy Tolba4
1,2,3,4
Faculty of Computer and Information Sciences,
Ain Shams University,
Cairo, Egypt
Email: ghadahamed@cis.asu.edu.eg1 , mohammedmarey@hotmail.com2 ,
ahmed_andeel76@hotmail.com3 and fahmytolba@gmail.com4

Abstract—The information capacity is growing significantly as raphy and steganography [1][3][8][9]. Cryptography includes
well as its level of importance and its transformation rate. In converting some data to incomprehensible format so that an
this paper, a blind data hiding hybrid technique is introduced intended recipient cannot determine its intended meaning [2].
using the concepts of cryptography and steganography in order
to achieve double layer secured system. The proposed method While steganography aims to hide the existence of the message
consists of two phases: phase one is converting the message to in a different media in order to prevent attracting the attention
DNA format using the proposed n-bits binary coding rule leading that the data is there. So, a novel data hiding method is
to high algorithm’s cracking probability compared with those proposed by combining the means of cryptography to encrypt
of other algorithms. Followed by applying the Playfair cipher the secret data and steganography to hide the encrypted data
based on DNA and amino acids to encrypt the secret message
which generates ambiguity. Phase two is hiding the cipher secret to provide double layers security system.
message parts with the ambiguity results from from the first Rest of the paper is organized as following: the related
phase. The data is hidden using the least significant base (LSBase) work in section II. In section III, the proposed technique is
only of each codon of a selected DNA reference sequence using introduced. Then, the algorithm’s security is measured then
3:1 hiding strategy. The proposed technique achieves hiding the analyzed in section IV, followed by the experimental results
data in DNA with preserving its biological functions as possible
without requiring any extra data to be sent to the receiver. in section V. Finally the conclusion in section VI.
Index Terms—Hybrid Technique, Cryptography, Steganog-
raphy, Double Layer Security, Playfair Cipher, DNA, Amino II. R ELATED W ORKS
Acids, Binary Coding Rule, Least Significant Base, Cracking In this section, we briefly review some of recent DNA
Probability, Security parameters. based steganography techniques. In [10], three data hiding
methods were proposed based on DNA and they are considered
I. I NTRODUCTION the main techniques. The first technique is the insertion
Information security is of increasing importance with the method by inserting bits from secret message M randomly
fast developing era as well as its confidentiality. Consequently, in separated positions in a DNA reference sequence. This
high level of security is required as it is a critical feature for technique expands the length of the original sequence due to
thriving networks [1][2][11]. So, the research concerning data insertion. The second one is the complementary pair method
hiding techniques has been increased continuously, due to the such that the longest complementary pairs in a DNA reference
necessary need for powerful data protection in different appli- sequence is detected. Then, the message parts are hidden
cations. Applications such as annotation, ownership protection, before them so it expands the original sequence’s length as
copyrighting, authentication and military. Data hiding requires well. The last technique is the substitution method that is
a carrier to hide the data in it such as image, video and audio implemented by substituting some of the DNA nucleotides
as in [1][3][4][5][6]. with other nucleotides based on the secret message bits with
For achieving maximum protection and powerful security no expansion as in [10].
with high capacity and low modification rate, Deoxyribonu- In [5], a data hiding substitution based scheme was proposed
cleic acid (DNA) is explored as a new carrier for data hiding. by substituting the repeated characters of a DNA sequence.
A remarkable property of DNA, is the capacity in which 106 This is done by establishing an injective mapping between
TB of data can be stored in 1 gm of DNA. However like one complementary rule and two secret bits in a message.
every data storage device, DNA requires protection through Complementary rule is the rule that specifies the strand of
secured algorithm. Various biological properties of DNA se- DNA directly opposite to a specified sequence. This algorithm
quences can be exploited for obtaining successful secured data minimizes the modification rate as a result of substituting the
embedding process [7][11]. This leads to a new born research repeated nucleotides of the DNA sequence. This substitution
field based on DNA computing. also leads to no expansion in the original reference sequence.
The most common and widely used techniques in the com- However, the modification rate can be very high if the DNA
munication security and computer security fields are cryptog- sequence contains a lot of repeated characters. Also, it is not

978-1-4673-9360-7/15/$31.00 2015 IEEE 95


a blind algorithm since the original DNA sequence is required nowadays using the advanced computer processors that appear
by the receiver to retrieve the secret data [5]. daily.
Another idea for data hiding was proposed in [12] through Reference [13] introduced some modifications to the con-
encrypting a secret message using DNA-based Playfair cipher ventional Playfair cipher by utilizing some biological concepts
and amino acids. Then, the encrypted secret message is hidden as DNA and amino acids to strengthen the ordinary Playfair
in a DNA reference sequence using the substitution technique. cipher.
Substitution here is based on two by two complementary rule. In the proposed algorithm, DNA and amino acids based
The following method hides the alphabets in doubles. It is Playfair cipher is applied as the following:
considered a fast algorithm with high capacity and it is not a 1) Input: Secret message M and secret key K.
blind algorithm [12]. 2) Processing: The secret message is mapped into its
corresponding ASCII then to binary using 8-bits coding to
III. T HE PROPOSED TECHNIQUE form Mbin .
Mbin is mapped to DNA nucleotides using a binary coding
The proposed scheme consists of two phases: the first one is
rule (BCR) to be encrypted using DNA and amino acids
converting the secret message to DNA by mapping the binary
Playfair cipher. The naive binary coding rule maps each 2
bits to DNA nucleotides using a proposed n-bits binary coding
bits to one DNA nucleotide, for example (A 00, C 01, G
rule. N-bits binary coding rule works on mapping n-bits from
10, T 11). The naive rule makes 24 permutations. The first
the message to m-bases of DNA and in this paper 4-bits binary
choice is for nucleotide A that will have 4 possibilities: 00,
coding rule is considered. Then, the DNA and amino acids
01, 10 or 11. The next choice is for C that will have 3
based Playfair cipher encrypts the encrypted message. Then,
binary codes that are remaining after removing the binary code
the ciphered message is hidden in a selected DNA sequence
assigned to A. Similarly, G will have two options and finally
from NCBI database [5] using the LSBase method. So, the
T will be assigned the last remaining binary code. So, the
first contribution is providing double layer security system by
overall unrepeated permutations = 4*3*2*1 = 24. While, the
developing a hybrid technique to hide the encrypted secret data
proposed binary coding rule in Table I maps each four bits
in DNA results in high security as will further discussed in
of the binary message to two DNA nucleotides in order to
the security analysis section. The second contribution is using
increase the algorithm’s security as the BCR shown in Table
n-bits binary coding rule to convert the binary format of a
I. Then, AA can be assigned by 0000 or 0001 or 0010 or
text to DNA that results in increasing the algorithm’s cracking
.. or 1111 so it has 16 possibilities. The next choice is for
probability in obvious way. Finally, the third contribution is
AC that will have 15 binary codes that are remaining after
the innovation idea 3:1 ratio used in the data hiding technique.
removing the binary code assigned to AA. After that AG will
This strategy hides the secret data in DNA using LSBase
have 14 options after AA and AC are assigned and so on. So,
method which results in avoiding extra data to be sent to the
the overall unrepeated permutations = 16*15*14*..*3*2*1 =
receiver.
16! that leads to high cracking probability as will be discussed
in the security analysis section. Simply, the binary coding rule
A. Phase I: Data encryption - Sender side
can be generalized by assigning 2n bits to each n nucleotides
In sender’s side, the encryption step is preferred before bases to achieve the required high degree of security by the
steganography step to avoid hiding the original format of the system. This is declared by the following example: A, AA,
secret message in the DNA for achieving double layer of AAA, AAAA, .... (A...A) bases where A is repeated in the
security: encryption and steganography. last term n times can be assigned respectively to 00, 0000,
The proposed technique uses DNA and amino acids Playfair 000000, .... (00...00) bits where 00 is repeated in the last term
cipher to encrypt the secret message. Conventional Playfair n times.
cipher is a symmetric encryption technique that encrypts a The output DNA of the secret message is converted to amino
text message using a 5*5 table. It is constructed using a secret acids according to the new distribution of the alphabet with
key word and the remaining letters of the alphabet that are their corresponding new codons in [13]. The new distribution
not included in the key word. Playfair cipher encrypts pairs is derived from the standard universal table of amino acids
of letters (digraphs) instead of single letters (monographs) and their DNA codons representation. Since each amino acid
which has advantage in avoiding the attack of the message is associated with multiple codons as in [13] and the mes-
using frequency analysis of monographs. However, there are sage is converted from DNA to amino acids. There should
severe drawbacks for conventional Playfair cipher that should be something that refers to the index of each DNA codon
be taken into consideration [8][13]: It is based on encrypting corresponding to each amino acid to be able to retrieve the
the message in diagraphs which can be noticed by frequency correct codon by the receiver in the decryption phase when
analysis. Consequently, the attacker may get some information amino acids convert to DNA. These indices are called AMBIG
about the data. Another drawback, is that it is applied to En- which refer to ambiguity. So, for successful retrieval, keep in
glish alphabet letters only, so, it is unable to encode any special track the ambiguity of each amino acids in AMBIG.
characters or numbers representing equations, numerical data Playfair cipher is applied using the secret key to encrypt
or symbolic data. Also, it is easy to break the system’s security the amino acids form of the secret message formed from the

96 2015 Seventh International Conference of Soft Computing and Pattern Recognition (SoCPaR 2015)
TABLE I: Proposed 4-Bits Binary Coding Rule that is used binary coding rule as following: The AMBIG as well is con-
to Convert Binary Format of a Message to DNA Format verted to binary AMBIGbin and since the maximum number
DNA Binary Repre- DNA Binary Repre- of codons corresponding to an amino acid is 4, indexed from
Nucleotides sentation Nucleotides sentation 0 to 3 so it can be represented in maximum 2 bits as shown
AA 0000 GG 1000 in Fig. 2. Select a DNA reference sequence from one of the
AC 0001 GA 1001 public databases such as EBI or NCBI and convert it to RNA
AG 0010 GC 1010
by substituting each T with U.
AT 0011 GT 1011
CC 0100 TT 1100 Hide cipherMbin and AMBIGbin using LSBase method.
CA 0101 TA 1101 Since, hiding methodology depends on the message bits and
CG 0110 TC 1110 the LSBase of each codon in the DNA. The LSB of S is
CT 0111 TG 1111 checked and if it is a purine base (A & G), it is substituted by
(G) to encode 1 of the secret message or (A) to encode 0. If
the LSB of S is a pyrimidine base (C & T), it is substituted by
(C) to encode 1 or (U) to encode 0. LSBase algorithm neglects
the following codons: UGA, UGG, AUA and AUG during the
hiding process since according to the standard distribution of
DNA codons to amino acids, Trp and Met amino acids have
a single codon which are AUG and UGG respectively [14].
Also, stop has only one codon which is UGA which will be
neglected too. Finally lle is coded by three codons: AUU, AUC
and AUA, so, AUA is neglected and AUU and AUC will be
used in data hiding [14]. The complete data hiding scheme is
shown in Fig. 3.
3) Output: Output from phase I, not only the cipher binary
message but also the ambiguity results from converting DNA
format of the message to amino acids. The objective of the
proposed method is to hide the secret message and the ambi-
guity required by the receiver to retrieve the secret message
from the DNA without additional information. This because,
the ratio of the length of the binary cipher message to the
length of the binary ambiguity is 3:1 as will be discussed in
subsection C. So, hiding the message with the ambiguity in the
Fig. 1: Data encryption flowchart DNA sequence using 3:1 ratio avoids adding additional data to
mark the starting position of the message in the DNA reference
sequence and the starting position of the ambiguity as well.
last step into cipher amino acids form. The formed ciphered Consequently, the data required to be sent is minimized and
amino acids is converted back to DNA by selecting the first it is the faked sequence S* only.
codon corresponding to each amino acid to form the cipher
DNA format of the message. The overall encryption process C. Example
is illustrated in Fig. 1. For example if we want to hide the following part of
3) Output: Ciphered DNA message, ambiguity (AMBIG). cipherMbin = 001011 and AMBIGbin = 00 in the DNA refer-
ence sequence S = AACTAGGGACATACGTACGGTTTA. As
B. Phase II: Data hiding - Sender side
can be seen in Fig. 4, the process starts at the first codon AAC
Least significant base is data hiding methodology proposed by hiding 0 in the LSB C, followed by hiding 0 in the LSB
in [14]. In a DNA sequence each three adjacent nucleotides of the second codon TAG which is G, then 1 is hidden in the
constitute a unit called codon. LSBase method depends on LSB A of the third codon GGA and so on as shown in row
hiding the secret message bits in the least signification bit 4. Then, each three bits of the message is followed by one bit
of each codon of the reference sequence. Any sequence is a of the AMBIG. So, after 0 0 1 of the message were hidden,
combination of some purine bases (A & G) and pyrimidine 1 bit of the AMBIG will be hidden which is 0 and it will be
bases (C & T). In order to hide the cipher message bits, the hidden in the LSB of the fourth codon as presented in row 5.
following steps are applied:
1) Input: Ciphered DNA message (cipherMDN A ), ambigu- Data encryption and hiding using the ratio of 3:1 will be
ity (AMBIG), and a DNA reference sequence. obvious using the following illustrative :
2) Processing: The formed DNA (cipherMDN A ) from the Assume message M = "RNA", key = "SECURITY"
encryption phase is converted into binary again to form cipher
binary message (cipherMbin ) by using 4-bits representation | M | = 3 characters. (1)

2015 Seventh International Conference of Soft Computing and Pattern Recognition (SoCPaR 2015) 97
nucleotides, then:
MDN A = CAA GCC TCC CAC
| Mbin |
| MDN A | = ∗ 2 bases = 12 bases. (3)
4
MDN A is converted to MAA according to the distribution
of the alphabet with their corresponding codons in [5] while
saving the ambiguity as it is necessary to retrieve the message
at the other side.
Fig. 2: Conversion rule to convert; Ambiguity numbers (0 to MAA = QASH and AMBIG = 0 1 1 1.
3) to binary
| MDN A |
| MAA | = | AM BIG| = = 4 AM BIGN umbers .
3
(4)
AMBIG is ranged from 0 to 3 so itk2 can be represented
in maximum 2 bits, so:
| MDN A |
| AM BIGbin | = ∗ 2 = 8 Bits. (5)
3
Encrypt MAA using Playfair cipher by constructing 5*5
matrix using the key and the remaining alphabet letters as
shown in Fig. 5.
By applying Playfair cipher then CipherMAA = XIUD.
| CipherMAA | = | MAA | = 4 Bases. (6)
Convert back to DNA then to binary to be hidden in a DNA
sequence using LSBase method, so it will be as the following:
Fig. 3: Steganography step: hide each 3 bits of Mbin followed
CipherMbin = 00101101 11000010 00100011.
by 1 bit of AMBIGbin using LSB method
| CipherMbin | = | Mbin | = 24 Bases. (7)

Phase I: Applying data encryption - Sender side Phase II: Applying data hiding - Sender side
Each char in the secret message M is converted to binary using To hide Mbin and AMBIGbin , we find the relation between
8-bits coding, then: their sizes is 3:1 since the length of Mbin as shown in the
Mbin = 01010010 01001110 01000001 previous example is 24 besides the length of AMBIGbin is 12,
so the proposed method hide each 3 bits of the Mbin followed
| Mbin | = 8 ∗ | M | = 24 bits. (2) by 1 bit of the Mbin which doesn’t need additional information
to be hidden.
Mbin is converted to MDN A using 4-bits representation After Mbin and AMBIGbin are obtained from the previous
binary coding rule, i.e., each 4 bits are substituted by 2 steps, they are hidden by applying LSBase method and 3:1
methodology using the first 6000 nucleotides from AC00527
sequence which is a real DNA reference sequence from
NCBI (National Center for Biotechnology Information)
database [15]. The first 96 bases from AC00527 have been
used only to hide Mbin and AMBIGbin and the generated
faked sequence sent to the receiver is:
CTTTCTGTCGCATTAAAGTTCATTTCTTTGTAGCTTCTG
CCTGCTGGGCTTGAATCCGATCTTTAAAGACTGCACGC
ACATACACATACGCACACG.
D. Data extraction - Receiver side
The sender sends the faked DNA sequence to the receiver.
Then, the receiver applies the Playfair cipher to decrypt the
Fig. 4: Hiding sequence bits of Mbin and AMBIGbin ; Row 1 message using the secret key. Both sender and receiver will
is the real sequence; Row 2 is the least significant carrier share the secret key from the beginning. But sharing a secret
base; Row 3 is the LSBase type (Pur. is for Purine Pyr. is key may posses a problem since it needs to be interchanged
for Pyrimidine); Row 4 is the message bits; Row 5 is the before applying the encryption process. To avoid this problem,
ambiguity bits; Row 6 is the codon’s LSBase of faked the proposed method can be modified to hide the secret
sequence; Row 7 is the faked sequence. key within the faked sequence. At the receiver side, when

98 2015 Seventh International Conference of Soft Computing and Pattern Recognition (SoCPaR 2015)
Fig. 5: 5*5 Playfair matrix using SECURITY as a key

the faked DNA sequence S* is sent to the receiver without


any additional data which enhances the algorithm’s security,
the receiver should apply two phases: data extraction and
message decryption in order to retrieve the secret data which
is contained in the faked DNA sequence.
1) Data extraction phase: The extraction process as shown
in Fig. 6 is simply the inverse of the embedding algorithm
where LSBase method is used and S* is divided into codons. Fig. 6: Extraction phase at the receiver side
Check the least significant base of each codon to retrieve the
hidden bits of the secret message. If the LSBase is either ‘T’or
‘A’ then the embedding bit was ’0’. If it is ‘C’or ‘G’ then the
embedding bit is ‘1’ [14]. Each three extracted consecutive
bits by LSBase method are added to the secret message and
the next bit is added to the ambiguity of the secret message
till Mbin and AMBIGbin are completely extracted from S* .
2) Decryption phase: Decryption is the inverse of the
encryption phase as shown in Fig. 7 where Mbin is converted
to DNA using proposed 4-bits binary coding rule. Then, the
ciphered DNA format is converted to amino acids to apply the
Playfair cipher on it using the secret key. Decrypted amino
acids form is generated from Playfair cipher then AMBIGbin
is converted to decimal digits by mapping each two bits to
number. Use each ambiguity number with each amino acid
character to retrieve the corresponding codon to this char
associated with this ambiguity number. Finally, a sequence
of DNA is retrieved, by converting it into binary using 4-bits
binary coding rule then to ASCII to get the corresponding
plain text which is the original form of the secret data.

IV. SECURITY ANALYSIS


In this section, security analysis has been done and pre-
sented. Then, a comparative study has been introduced in-
cluding the proposed scheme and some of the recent DNA
based steganography algorithms according to multiple security Fig. 7: Decryption phase at the receiver side
parameters.

A. Analysis of the proposed approach to make a successful guess to the correct chosen reference
sequence (DNARef ) is:
In terms of security, there are fundamental information that
must be known by each intruder to be able to extract the plain 1
P (DN ARef ) = . (8)
text of the original secret data. These fundamental information 1.63 ∗ 108
are: DNA reference sequence, binary coding rule and LSB 2) Binary coding rule: The proposed 4-bits representative
substituted permutations. The analysis of these parameters is binary coding rule represents each two nucleotides by four
as following: bits. Since there are 4 bases (A, C, G and T), then the possible
1) DNA reference sequence: There are 163 million DNA combinations from them are 42 = 16 couples; (AA, AC, AG,
sequences available publicly. So, the probability of an attacker AT, CC, CA, ..). The sender is free to select any equivalent

2015 Seventh International Conference of Soft Computing and Pattern Recognition (SoCPaR 2015) 99
four bits to every two nucleotides. It means that, AA can be the extraction phase. M2 and the proposed scheme M5 are
represented by ‘0000’, ‘0001’, ‘0010’, ‘0011’, ..., ‘1111’, so, it blind algorithms. The sixth parameters views the cracking
has 16 options to be selected. If ‘0000’ is selected to represent probability formula of each technique in Table II.
AA, AC can be represented by ‘0001’, ‘0010’, ‘0011’, ‘0100’, The last two parameters define the advantages and disadvan-
..., ‘1111’, so, it has 15 options and so on till 16 pairs of tages of each technique. Several advantages of the new pro-
nucleotides have been assigned a binary representation. So, posed method are obtained include the following: double layer
simply all binary coding rules are(# of BCRs): security system as the data is encrypted before it is hidden.
Also, low modification rate as only the least significant base of
#of BCRs = 16 ∗ 15 ∗ 14 ∗ 13 ∗ 12 ∗ .. ∗ 4 ∗ 3 ∗ 2 ∗ 1 = 16!. (9)
each codon is changed. Preserve the biological functions of the
Consequently the likelihood of making correct guess to the original DNA sequence after hiding as a result of substituting
applied binary coding rule (BCR) is: LSBase with purine base if it is purine or with pyrimidine
base if it is pyrimidine. Only what is required by the receiver
1
P (BCR) = . (10) to retrieve the secret data correctly is the fake DNA sequence
16!
without any external helper data. Finally, the message is not
3) The least significant base substitution rule: LSBase hidden in the DNA reference sequence in a continuous way
method is applied by substituting pyrimidine base by ‘U’ to as it is separated with some ambiguity bits.
encode the secret bit ‘0’ or ‘C’ to encode ‘1’. But it is
also can encode ‘0’ by C and ‘1’ by U and the same for V. EXPERIMENTAL RESULTS
the purine base. So, briefly ‘0’ secret bit can be encoded This section describes a set of experiments carried out
by substituting pyrimidine base either by ‘U’or ‘C’. If it is to evaluate the performance of the proposed scheme. The
selected to be substituted by ‘U’ then ‘C’ will be used to proposed scheme was tested on Intel(R) Core (TM) i5-3230M
substitute pyrimidine base to encode ‘1’. So the number of CPU @ 2.60 GHz personal computer with 6 GB RAM.
possibilities is 2*1 guesses and the same will be done for the The algorithm is implemented using Matlab bio-informatics
purine base. So, the probability of making successful guess is toolbox. The eight real DNA sequences in Table III were used
for the substituted nucleotides N is: and they are publicly available among a lot of sequences and
1 can be accessed from NCBI database [15]. Some parameters
P (N ) = . (11)
4 are used usually for evaluating the system’s performance: the
Therefore, using the proposed method, the total probability first parameter is the capacity which is the total length of the
of an attacker making a successful guess (SG) is: modified DNA reference sequence after the secret message
and ambiguity are hidden within it. The second parameter is
1 1 1
P (SG) = 8
∗ ∗ . (12) the payload which is the remaining length of the new DNA
1.63 ∗ 10 16! 4 sequence after extracting out the reference DNA sequence. The
B. Comparative study third parameter is the bpn which is the number of bits hidden
Some of the recent DNA based steganography algorithms per nucleotide.
are compared according to different parameters as in Table The experiment has been done on a secret message M in
II. The first parameter is the secret text type which tells if the a file of size 21 kilo bytes containing letters, numbers and
algorithm hides different types of data format (letters, symbols special characters and secret key is ’SECURITY’. The secret
or numbers) and this is employed in all algorithms mentioned message is first extracted from the file then it is converted to
in Table II. The second parameter is the type of the binary DNA sequence using the proposed n-bits binary coding rule to
coding rule used in the conversion from the binary format of be encrypted by DNA and amino acids based Playfair cipher.
the message to DNA. All methods in the table use 2-bits binary The value of n is set to be four. Then the obtained ciphered
coding rule while the algorithm introduced in this paper is 4- text is hidden in each one of the eight sequences in Table III
bits based which increases the system’s security and decreases using LSB method with ambiguity by hiding each three bits of
the processing time too. The third parameter shows if the the message followed by one bit of ambiguity to measure each
method encrypts the secret data before hiding it or not. M1 of capacity, payload and bpn. Each of the eight DNA reference
encrypts the data by converting it to DNA then amino acids sequences used by the algorithm to analyze the system. Each
form. M2 and M5 encrypts the secret data using DNA and sequence is defined by locus that acts as the reference sequence
amino acids playfair cipher. M3 and M4 hides the original ID and by the number of nucleotides that determines how
format of the data without encrypting it. many bases of the sequence.
The fourth parameter determines the data hiding algorithm Table IV displays the experimental results in terms of
used. M1 and M2 hide the secret message in the DNA using capacity, payload and bpn parameters to evaluate the system’s
the insertion method. M3 hides the secret message using com- performance. In the proposed algorithm, the capacity includes
plementary rules while M4 and M5 hides the secret message hiding the secret message and its ambiguity bits as well in the
by substituting DNA nucleotides based on the message bits. sequence. Payload is zero which means that the length of the
The fifth parameter tells if the embedded data can be retrieved fake DNA reference sequence is not expanded after hiding the
without the need to the original reference DNA sequence in message bits within it which avoid drawing attention to it. This

100 2015 Seventh International Conference of Soft Computing and Pattern Recognition (SoCPaR 2015)
TABLE II: Comparison Between Various Steganography Techniques
Comparison M1: Enhanced Double M2: DNA Base Data M3: Proposed M4: A New Data M5: The Proposed
Criteria Layer Security using Encryption and Steganography Hiding Scheme Scheme
RSA over DNA based Hiding using Playfair Approach using Based on DNA
Data Encryption [2] and Insertion DNA Properties [1] Sequence [4]
Techniques [7]
Secret Text Any type of data Any type of data Any type of data Binary data Any type of data
Type
Used Binary 2-Bits binary coding 2-Bits binary coding 2-Bits binary coding Binary coding rule 4-bits binary coding
Coding Rule rule rule rule independent rule
Encryption Encrypting secret data 5*5 Playfair cipher No encryption No encryption Encryption used DNA
Algorithm by mapping it to DNA based on DNA and and amino acids
and amino acids amino acids based Playfair cipher
Data Hiding Insertion Method Insertion Method Complementary rules Substitution method Substitution method
Algorithm based hiding method using repeated using least significant
which is the rule that nucleotides to hide the base of each codon in
specifies the strand of secret message bits the DNA reference
DNA directly opposite sequence
a specified sequence
Blind/ Not Not blind Blind Not blind Not blind Blind, only the faked
Blind DNA sequence is
required to be sent to
the second part
P (S) = P (S) =
1
∗ 1
∗ P (S) = P (S) = P (S) =
System’s (163∗106 )∗(24)∗(n−1) (163∗106 )∗(24)∗(n−1) 1 1 1
Cracking 1 1 (163∗106 )∗(24)∗(24) (163∗106 )∗(6)∗(24) (163∗106 )∗(16!)∗(4)
(2m−1 )∗(2 s−1 ) (2m−1 )∗(2 s−1 )
Probability
P(S)
Advantages
• Provides double • Provides double • Does not expand • Doesn’t expand • Double layer se-
layer security layer security by the original the original curity.
by encrypting encrypting the DNA reference DNA reference • Low
the data using data using DNA sequence after sequence after modification.
DNA and amino and amino acids the data hiding. the data hiding. • Preserve
acids then hiding based Playfair • Moderate • Simple to be im- biological
the cipher data cipher then hiding modification plemented. functions of the
formed. the cipher data as not all of original DNA
• High capacity. formed in a nucleotides are sequence.
DNA reference not changed. • Only what is
sequence. wanted is the
• High capacity. faked DNA
• Blind algorithm. sequence.
• The message is
not hidden in a
continuous way.

Disadvantages
• High modification • Expanding the • Doesn’t preserve • High • Moderate
due to inserting length of the DNA biological modification due capacity as
message parts DNA sequence functions. to substituting not only the
within a DNA due to hiding • Not blind. all repeated message is
sequence. using insertion • Additional data nucleotides in a hidden, but also
• Expanding the methodology, i.e. are required DNA sequence. the ambiguity
length of the payload = 0. to be sent • Doesn’t preserve bits are hidden
DNA sequence • Doesn’t preserve with the faked DNA biological but with using
due to hiding DNA biological sequence which functions. the ratio 3:1,
using insertion functions. are the indices • Hide the origi- no additional
methodology, i.e. • Multiple data of nucleotides nal form of the data is required
payload = 0. are sent to the containing the secret data with- to inform the
• High modification receiver to be able secret data. out encrypting it receiver with
rate that results to recover the on converting it the beginning of
in changing the secret message to another un- the message and
main DNA bio- which are the two derstood format. the ambiguity as
logical functions. random seeds well.
[R] and [K] used
in the insertion
method.

2015 Seventh International Conference of Soft Computing and Pattern Recognition (SoCPaR 2015) 101
TABLE III: Eight DNA Reference Sequences Used in the sequence. It preserves the original DNA sequence length as it
Testing Phase depends on substitution only so the algorithm’s payload is zero
Number of which avoids attracting the attention to the faked sequence.
Locus nucleotides Species definition As a future work, the proposed method can be modified
AC166252 149,884 Mus musculus 6 BAC RP23-100G10
to increase the hiding capacity of the DNA sequence with
increasing the algorithm’s security too.
AC168901 191,456 Bos taurus clone CH240-18511
AC168907 194,226 Bos taurus clone CH240-19517 R EFERENCES
AC153526 200,117 Mus musculus 10 BAC RP23-383C2
[1] G. Hamed et al., “DNA Based Steganography: Survey and Analysis for
AC168897 200,203 Bos taurus clone CH240-190B15 Parameters Optimization,” in Applications of Intelligent Optimization in
AC167221 204,841 Mus musculus 10 BAC RP23-3P24 Biology and Medicine, Springer, 2015, ISSN: 1868–4394, pp. 47–89.
[2] B. A. Mitras and A. K. Abo, “Proposed Steganography Approach using
AC168874 206,488 Bos taurus clone CH240-209N9
DNA Properties,” International Journal of Information Technology and
AC168908 218,028 Bos taurus clone CH240-195K23 Business Management, ISSN: 2304–0777, vol. 14, Issue No. 1, pp. 96–
102, June 2013.
[3] M. Skariya and M. Varghese, “Enhanced Double Layer Security using
TABLE IV: The Results Obtained Using the Proposed RSA over DNA based Data Encryption System,” International Journal
Scheme to Hide 20K Bytes of Secret Data within the Tested of Computer Science & Engineering Technology (IJCSET), ISSN: 2229–
DNA Sequences 3345, vol. 4, Issue No. 06, pp. 746–750, Jun 2013.
[4] Y. A. Yunus, S. Ab Rahman and J. Ibrahim,“Steganography: A Review
Sender Receiver of Information Security Research and Development in Muslim World,”
Capacity Pay- bpn = side side American Journal of Engineering Research (AJER), ISSN: 2320–0936,
Locus |M |+|A|
(Bits) load (Time in (Time in vol. 02, Issue No. 11, pp. 122–128, 2013.
C
seconds) seconds) [5] C. Guo, C. Change and Z. Wang, “A New Data Hiding Scheme based
AC166252 46,685 0 3.7 0.49 0.32 on DNA Sequence,” International Journal of Innovative Computing,
AC168901 58,488 0 3.0 0.56 0.42 Information and Control, ISSN: 1349–4198, vol. 8, Issue No. 1, pp. 139–
AC168907 60,032 0 2.9 0.62 0.42 149, Jan 2014.
[6] I. K. Maitra, “Digital Steganalysis: Review on Recent Approaches,”
AC153526 61,862 0 2.8 0.70 0.42
Journal of Global Research in Computer Science, ISSN: 2229–371X,
AC168897 60,077 0 2.9 0.68 0.43 vol. 2, Issue No. 1, pp. 1–5, Jan 2011.
AC167221 63,042 0 2.7 0.58 0.45 [7] J. Taur, H. Lin, H. Lee and C. Tao, “Data Hiding in DNA Sequences
AC168874 63,271 0 2.7 0.60 0.44 based on Table Lookup Substitution,” International Journal of Innova-
AC168908 66,622 0 2.6 0.8 0.47 tiveComputing, Information and Control, ISSN: 1349–4198, vol. 8, Issue
No. 10, pp. 6585–6598, Oct. 2012.
[8] A. Atito, A. Khalifa and S. Z. Rida, “DNA-based Data Encryption and
Hiding using Playfair and Insertion Techniques,” Journal of Commu-
is achieved as a result of hiding the secret data by substituting nications and Computer Engineering, ISSN: 2090–6234, vol. 2, Issue
the nucleotides. Furthermore, bpn is within [2.6, 3.7] and the No. 3, pp. 44–49, 2012.
[9] A. K. Kaundal and A. K. Verma, “DNA based Cryptography: A Review,”
proposed scheme has an acceptable embedding capacity which International Journal of Information and Computation Technology,
is distributed on both the message and ambiguity bits which ISSN: 0974–2239, vol. 04, Issue No. 7, pp. 693–698, 2014.
results in increasing the total number of nucleotides required [10] H.J. Shiu, K.L. Ng, J.F. Fang, R.C.T. lee and C.H. Huang, “Data
Hiding Methods based upon DNA Sequences,” Journal of Information
for hiding the message bits only. Finally, the execution time Sciences: an International Journal, vol. 180, Issue No. 11, pp. 2196–
to encrypt and hide 20KB of data is calculated. It is shown 2208, June 2010.
from Table IV that the capacity and the execution time are [11] M. R. N. Torkaman, N. S. Kazazi and A. Rouddini, “Innovative
Approach to Improve Hybrid Cryptography by using DNA Steganog-
affected by the length of the DNA sequence used i.e. the DNA raphy,” International Journal of New Computer Architectures and their
sequence’s length is directly proportional to the execution Applications (IJNCAA), ISSN: 2220–9085, vol. 2, Issue No. 1, pp. 224–
time. As the DNA sequence’s length increases, its hiding 235, 2012.
[12] A. Khalifa and A. Atito, “High-Capacity DNA-based Steganography,”
capacity increases and consequently the execution time too in Informatics and Systems (INFOS), 8th International Conference on
and visa verse as shown in Table IV. 2012, May 2012, pp. BIO–76.
[13] M. Sabry, M. Hashem, T.Nazmy and M. E. Khalifa, “A DNA and Amino
VI. CONCLUSION Acids-based Implementation of Playfair Cipher,” International Journal
of Computer Science and Information Security, ISSN: 1947–5500, vol. 8,
In this paper, a data hiding method is proposed by combin- Issue No. 3, pp. 129–136, 2010.
[14] A. Khalifa, “LSBase: A key Encapsulation Scheme to Improve Hybrid
ing the means of cryptography and steganography as well. Crypto-systems using DNA Steganography,” in 8th International Con-
This achieves double layer security of the system. A new ference on Computer Engineering & Systems (ICCES), Cairo, Egypt,
binary coding rule is proposed that assigns 2n bits to each Nov. 2013, pp. 105–110.
[15] NCBI Database, Bank for real DNA reference sequences, http://www.
combination of n nucleotides instead of assigning two bits ncbi.nlm.nih.gov/
to only one nucleotide which increase the number of rules
from 4! to (2n*2n)! binary coding rules leads to strengthen
the algorithm’s security. Due to using LSB method in hiding
the cipher bits of the message and the ambiguity, the proposed
algorithm is still blind as the embedded data can be extracted
without the need to the original DNA reference sequence. The
proposed method also doesn’t expand the real DNA reference

102 2015 Seventh International Conference of Soft Computing and Pattern Recognition (SoCPaR 2015)

You might also like