You are on page 1of 6

Real-time ASCII Art Camera on Mobile Device

Yizhe Hu, Zhilin Jiang, Zachary Hirn


University of Wisconsin Madison
http://ascpi.weebly.com
Abstract
In this paper, we present a way to compute a ASCII representation of some input video
stream on real time on mobile devices. Mobile devices, unlike modern person
computers, have very limited computing power. This paper presents a machine learning
approach, along with other optimization techniques to achieve a real-time ASCII art
conversion with a satisfactory speed.

1. Introduction
The motivation behind doing this project
was that we wanted to work on something
fun. We thought about several dierent
ideas before settling on making an
application that creates ASCII Art. The iOS
applications first step is to take 640 480
video from the back camera. The application
then takes the raw video frame from iOS
AVFoundation (as uint8_t *, BGRA format),
converts it to grayscale through iPhones
ARM64 neon instructions; then the
grayscaled frames are passed to a pre-learnt
decision tree to determine which
corresponding ASCII character will represent
the section of the frame; we finally display
the output string of characters on the
iPhones screen. The decision tree that we
are using is the trained decision tree
provided by the paper author. In their
implementation, they encoded each ASCII
character as rendered grayscale pixels. This
yields a better performance in terms of
displaying the pixels on screen, but it limits

our ability to manipulate colors (since we


are making the colored ASCII art). To
account for this limitation, we decided to
reverse-engineer the provided decision
trees structure to determine which ASCII
character is being mapped to which
particular leaf node. As a result of this
approach, we were able to produce a string
as the ASCII representation of a frame. Since
our ASCII art is inside a string, we can
change each characters color as needed.
Since the paper didnt cover the coloring
problem, we came up with a reasonable way
to summarize the color value in each 8 8
pixel block, and determine the final color to
give each corresponding character.
2. Method
To reduced the computational complexity,
we used machine learning technique to
achieve real-time result on mobile devices.
Some key techniques are described in this
section.

2.1 Structure Similarity Mapping (SSIM)


According to Markus et.al, the best way to
map a block of pixel to a ASCII character is to
run the Structure Similarity Mapping
algorithm on the given block, against a
rasterized (pixelized) version of the
character. The SSIM algorithm tries to
compute an index indicating the luminance,
contrast, and most importantly the
structural similarity between two blocks of
pixels. The algorithm is defined as follows:

codebook (i.e. set of candidate ASCII


characters) consists of ASCII characters
ranging from 32 to 126. This means for every
frame of the video we have 4800 ((640
480) (8 8)) blocks, and for each block we
need to run SSIM algorithm against 94
candidate characters. Evidently, it is
impossible to run SSIM algorithm on a
mobile device on real time. To solve this
problem. Markus et. al proposed a machine
learning approach.
2.2 Decision Tree Model

In the equation is the pixel mean values,


12 and 22 is their variances, and 122 is their
correlation; and finally the SSIM index is
computed as:

The algorithm itself is highly


computationally heavy, as it requires us to
compute SSIM index between a given block
of pixels against every rasterized character
in the code book. In our setup, we are
capturing the video at 640 480 pixels with
pre-defined block size at 8 8, and our

In its essence, the problem of assigning


ASCII characters to a block of pixels is
nothing more than a classification problem
-- our goal is to classify each block of pixels
into one of the 94 ASCII characters.
Therefore, machine learning is naturally a
promising substitution that could be used
to classify the pixels in order to approximate
the eect of the SSIM algorithm. Specifically,
Markus et.al proposed a binary decision tree
approach. The decision tree is constructed
via a supervised learning technique using
the result of SSIM index mapping as the
training set. During the learning phase, a
threshold value is trained at every internal
node; In the decision phase, this threshold
value is used in a simple binary test on the
block of pixels intensity to determine which
branch (left or right) of the tree should be
taken next. Finally, the classified ASCII
characters are represented by the leaf
nodes. Markus et.al determined empirically
that a decision tree with height 16 results in
the best marginal gain (best tradeo
between result and performance). Therefore

with the decision tree approach, instead of


running SSIM algorithm 94 times per block,
we can simply perform 16 binary tests to
walk through the decision to approximate
the classification.
To save time in our implementation, we
used the trained tree provided by Markus
et.al (this tree was trained with 200 fivemegapixel photos). The tree is designed in
such ways that a rasterized character is
directly coded at every terminal node as sets
of 8 x 8 pixels, which could be displayed on
screen directly. While displaying the
characters in a pre-rendered format will
increase the performance dramatically
(details will be discussed in section 4), one
of our goals is to output the string
representation of the camera input.
Therefore, we reverse-engineered the
constructed tree to determine the
corresponding characters at each leaf node
so corresponding strings could be collected.
2.3 Neon Instructions
The decision provided by Markus et.al was
trained on a collection of grayscaled images.
However, the native output of iOS camera
system is in 32BGRA format: each pixel is
represented as an uint8_t in the order of
Blue, Green, Red, and Alpha. The translation
between RGB and grayscale format is
governed by a simple formula:
G = 0.2126 R + 0.7152 G + 0.0722B
Performing such translation is trivial, the
challenge is to calculate the values

eciently without going into the realm of


OpenGL (which will significantly increase
the coding diculty). Since we are running
the algorithm in real time on a mobile
device with limited computing power, we
want to reserve the computing power for
walking the decision tree and rendering the
pixels on screen. After some research, we
found an excellent blog post demonstrating
using ARM architecture's neon instructions
to eciently convert pixels in RGB format to
grayscale. As the blog author Khvedchenya
explained, NEON instructions comes from
SIMD technology supported by iPhones
ARM CPU. SIMD stands for Single
instruction, Multiple Data, and as its name
suggests, it provides a way to calculate
multiple pieces of data in parallel at each
clock cycle. In this case specifically, with the
help of NEON instructions, we are able to
calculate the grayscale value of 8 pixels
simultaneously at one clock cycle;
Khvedchenya also concludes that the pure
NEON assembly implementation of the
algorithm is about 6 times faster than pure C
++ implementation. Unfortunately, iPhones
CPU architecture has changed significantly
(armv7 to arm64, 32 bit to 64 bit) since the
article was originally posted in 2011, and we
were unable to rewrite the original armv7
assembly into the equivalent arm64 version.
However, Apple does provide us with a C
wrapper of the NEON instructions (defined
in <arm_neon.h>). This version might be
slightly slower than the pure assembly, but
it is still significantly faster than direct C
implementation.

2.4 Text Rendering vs Rasterized


Characters
Apart from mapping pixel blocks to ASCII
characters, another problem that we have to
solve is how to render the output on screen.
As mentioned in section 2, Markus et.al.
encoded the rasterized character into the
decision tree, and we were able to use them
directly to display. However, since our goal
was to output the string representation, we
decided to implement a way to render the
string directly onto the iPhone screen. As a
bonus, the rendered string will be sharper as
it accounts for iPhones retina display. Figure
1 shows the display mode of our App.
Text render is a very computationally heavy
task, it involves typesetting and rendering
the correct strokes for each character. We
experimented many text render techniques
on iOS, from high level UITextView to
CATextLayer, and finally to the low level
CoreText. We found out empirically that
even with the lowest level implementation,
iPhone could not eectively dynamically
render and color text (in this case 4800
characters) in real time. With CoreText
implementation, we were able to achieve
acceptable frame rate on black and white
texts; but the frame rate drops to one or
even less than one frame per second for
colored text. We believe the computational
bottleneck happens within iOSs
CoreFoundation, resulting in an extremely
slow operation of coloring the
CFAttributedStrings.

2.5 Coloring
In our naive implementation, we used the
first pixels color value in a block to color the
corresponding ASCII character (we also
increased the saturation of the color value,
so the color will be more obvious). Evidently,
using such approach will not result in a very
accurate color representation, and one can
easily come up with other color schemes
such as using average color in a block, or
using the dominant color in a block.
However, due to time limitation and the
complexity of computing such value, we left
out this particular feature. Adding average
color and/or dominant color text coloring
could be an improvement in the future.
2.6 iOS Framework
The majority of the core algorithms and
techniques that we used to transfer the raw
pixel input of the iPhone camera were
summarized above. However, it is still worth
mentioning the framework we used in the
iOS environment particularly to implement
this App. For camera input, we used
AVFoundation framework as it provides us
with raw input values; for displaying the
resulting image, we rendered the pixel value
directly onto a CALayer as its lighter than
the traditional UIView; finally, for text
rendering we used low level CoreText
framework and our customized subclass of
CALayer to achieve the maximum
performance.

Figure 1: Screenshot of our app running on four dierent modes. From left to right they
are: Black and White Rasterized, Colored Rasterized, Black and White Text, and Colored
Text.
3. Results
As demonstrated in our presentation, our
App ASCPI, running on a iPhone 6, could
transform and display the ASCII
representation of the iPhone camera input
in real time. 3 out of 4 of our display modes,
pixel black and white, pixel color, and text
black and white could achieve satisfactory
frame rate. However, since we only used the
color value of the first pixel from each block,
the colored version is not very accurate, and
this feature will be improved in the future
implementation.
4. References
Marku, N., Fratarcangeli, M., Pandi, I., &
Ahlberg, J. (2015). Fast Rendering of
Image Mosaics and ASCII Art. Computer
Graphics Forum(34), 251-261.
Khvedchenya, E. (2011, February 7). A very
fast BGRA to Grayscale conversion on
Iphone. Retrieved December 21, 2015,

from http://computer-vision-talks.com/
2011-02-08-a-very-fast-bgra-tograyscale-conversion-on-iphone/
(We also used Xcodes builtin
documentation)
5. Extra
Language: Objective-C
Platform: iPhone 6 (because of the
assembly code we used, it will only work
on arm64 architecture)
Lines of code: 800+
Team member and contribution:
Zhilin came up with the initial idea of phototo-ASCII conversion, found the Marku et al.
paper and their partial source C code that
our project is based on, reversed engineered
the trained decision tree provided in the
code, designed the live demo part of the inclass presentation and collaborated in the
design of the project website. Zachary
contributed to the initial dissection of the
papers code via some comments, the first

project abstract, helped design the website


for the project, the very rough initial
progress report, and attempted to find a
faster method for coloring. Yizhe did
basically everything else, including almost
all of the dissection of the papers code, all
of the translation between the papers C
code and our applications Objective-C code,
almost all of the new coding for the iOS
application (including coloring), almost all
of the progress report, and almost all of the
powerpoint presentation.

You might also like