Professional Documents
Culture Documents
C HA P T E R 13
Case Study of
a Technology
Manufacturer
T
he manufacturing line is a place where granular analysis and big
data come together. Being able to determine quickly if a device
is functioning properly and isolate the defective part, misaligned
machine, or illbehaving process is critical to the bottom line. There
were hundreds of millions of TVs, cell phones, tablet computers, and
other electronic devices produced last year, and the projections point
to continued increases for the next several years.
215
216 BIG DATA, DATA MINING, AND MACHINE LEARNING
The typical path of resolution for this type of modeling for this cus-
tomer is outlined in the next six steps.
1. The first step is the inspection process for the product. Each
product has a specified inspection process that is required to
ensure quality control guidelines are met and the product is
free from defects. For example, determining if a batch of silicon
wafers meets quality control guidelines is different from deter-
mining quality control acceptance for a computer monitor. In
the case of the wafer, the result is essentially a binary state: The
wafer works or it does not. If it does not work, it is useless scrap
because there is not a mechanism to repair the wafer. In the
case of the computer monitor, it could be that there is a dead
pixel, or a warped assembly, or discoloration in the plastic,
or dozens of other reasons the monitor does not pass quality
control guidelines. Some of these product defects can be re-
mediated by repeating the assembly process or swapping out a
defective part; others cannot be remediated, and the monitor
must be written off as waste. In these complex devices, humans
are still much better at visual inspection to determine defects in
the display (for more details see the Neural Networks section
in Chapter 5) than teaching a computer to see. The human eye
is very developed at seeing patterns and noticing anomalies. In
this manufacturing environment, the human quality control
CASE STUDY OF A TECHNOLOGY MANUFACTURER 217
team members can find more reasons for rejection than the
computer system by an order of magnitude.
If a product batch does not meet the quality control thresh-
old, then an investigation is initiated. Because the investigation
slows or stops the ability to manufacture additional items, an in-
vestigation is not done on a random basis but only after a batch
of product has failed quality control inspection. This investiga-
tion is expensive in terms of time, quantity, and disruption, but
it is essential to prevent the assembly of faulty products.
2. Once a batch of wafers, for example, has failed quality control,
an extract of data from the enterprise data warehouse (EDW) is
requested. This data is sensor data over the last few days from a
number of different source systems. Some of the data is stored
in traditional massively parallel process (MPP) databases; other
data is stored in Hadoop and still other in neither Hadoop nor
MPP databases. This is a complex process that involves join-
ing a number of tables stored in a relational schema of third
normal form and denormalizing them to a rectangular data
table along with matching them to MPP tables. The process to
manufacture a single wafer can include several thousand steps
and result in more than 20,000 data points along the assembly
line process. When this data is transformed and combined with
millions and millions of wafers manufactured every day, you
have truly big data. The assembled rectangular data table is of a
size that would defy the laws of physics to move for processing
in the allotted time window of a few hours.
3. The data is then clustered. With such wide data (more than
10,000 columns) it is clear that not all will be responsible for
the defects. The columns can be removed and clustered based
on several criteria. The first is if the column is unary, or only
has one value. With only one reading, the column holds no
useful information in determining the source of the defects.
The second is multicollinearity. Multicollinearity is a statistical
concept for when two or more input variables, the measure-
ments from the sensors in this case, are very highly correlated.
Multicollinearity does not affect the final predictive power of the
model, but it does needlessly increase the computational cost
218 BIG DATA, DATA MINING, AND MACHINE LEARNING
In this case, the customer used the distributed file system of Hadoop
along with inmemory computational methods to solve the problem
within the business parameters. This process had been developed,
refined, and improved over several years, but until the adoption of the
latest distributed inmemory analytical techniques, a problem of this
size was considered infeasible. After adopting this new infrastructure,
the time to compute a correlation matrix of the needed size went from
hours down to just a few minutes, and similar improvements were
seen in the other analysis steps of an investigation. The customers are
very pleased with the opportunity they now have to gain competitive
advantage in hightech manufacturing and further improve their pro-
cess and their position in the market.