Active Learning NG

A C TIV E LEA R N IN G
Navneet Goyal
Slides developed using material from:

1. Simon Tong, ACTIVE LEARNING: THEORY AND APPLICATIONS.
Ph.D. dissertation, Stanford University, August, 2001.
2. Burr Settles. ACTIVE LEARNING LITERATURE SURVEY. Computer
Sciences Technical Report 1648, University of Wisconsin
Madison. 2009
Introduction
If I tell you that you can achieve better
accuracy with less training, would you

believe me?
NO!!
It is possible when the learning
algorithm is:
Allowed to be curious
Allowed to choose the data from which it
learns
It is possible with ACTIVE LEARNING!
Introduction
Majority of ML tasks fall under:
Supervised Learning (for eg. Classification)
Unsupervised Learning (for eg. Clustering &
Model Building)
For all supervised & unsupervised learning
tasks, we first need to gather significant

amount of data randomly sampled from
the underlying population distribution
This is PASSIVE learning!!
So what is ACTIVE learning?
Passive Learning
Figure taken from Simon Tongs PhD Thesis
Introduction
One of the most resource intensive task
is gathering of data!
In most cases, we have limited resources
for collecting data
Try to make the best use of these
resources
Randomly collected data instances are
independent & identically distributed (iid)
Can we guide the sampling process?
Introduction
In most cases, data is abundantly
available
Mails, images, videos, songs, speeches,
documents, ratings, tweets, etc.
Which of these are different from others?
Mails & ratings
Labeled data is freely available
Others?
Labeled instances are very difficult, time
consuming, & expensive to obtain
Introduction
Some Examples where labeled data is
hard to come by:
Speech Recognition
Document Classification
Image & Video annotation
Introduction
Speech Recognition
Accurate labeling of speech utterances is
extremely time consuming and requires
trained linguists
Annotation at the word level can take ten
times longer than the actual audio (e.g.,
one minute of speech takes ten minutes to
label), and annotating phonemes can take
400 times as long (e.g., nearly seven hours)
The problem is compounded for rare
languages or dialects
Labeling bottleneck
Active learning systems attempt to overcome
the labeling bottleneck by asking queries in the

form of unlabeled instances to be labeled by an
oracle (e.g., a human annotator)
Active learner aims to achieve high accuracy
using as few labeled instances as possible,
thereby minimizing the cost of obtaining labeled
data
Active learning is well-motivated in many
modern machine learning problems where data
may be abundant but labels are scarce or
expensive to obtain
Introduction
Document classification
Large pool of unlabelled documents
available
Randomly pick documents to be
labeled manually
OR
Carefully choose (or query) from the
pool that are to be labeled
Introduction
Parameter estimation and structure discovery tasks
Studying lung cancer in a medical setting
preliminary list of the ages and smoking habits of
possible candidates that we have the option of

further examining.
Ability/resources to give only a few people a
thorough examination
Instead of randomly choosing a subset of the
candidate population to examine we may query for
candidates that fit certain profiles (e.g., We want
to examine someone who is over 50 & who
smokes).
Active Learning
We need not fix our desired queries in
advance
Instead, we can choose our next query
based upon the answers to our
previous queries
The process of guiding the sampling
process by querying for certain types of
instances based upon the data that we
have seen so far is called active
learning
Active Learning
An active learner differs from a passive

learner which simply receives a random data
set from the world and then outputs a
classifier or model
Figure taken from Simon Tongs PhD Thesis
Active Learning
An interesting analogy!
A passive learner is a student that gathers
information by sitting and listening to a

teacher while an active learner is a student
that asks the teacher questions, listens to the
answers and asks further questions based
upon the teachers response
This extra ability to adaptively query the world
based upon past responses would allow an
active learner to perform better than a
passive learner
Active Learning
The core difference between an active
learner and a passive learner is the ability

to ask queries about the world based
upon the past queries and responses
The notion of what exactly a query is and
what response it receives will depend
upon the exact task at hand
The possibility of using active learning can
arise naturally in a variety of domains &
in several variants
Active Learning
The key hypothesis is that if the
learning algorithm is allowed to

choose the data from which it learns
to be curious , if you willit will
perform better with less training
A curious student generally
performs well!!
Do you agree??
You better agree and become a
curious student
Active Learning
ML algorithms choose the training
tuples from a large pool

What do they gain by doing so?
Improved Accuracy?
If YES, How?
Active Learning
Also called Query Learning in ML
Optimal Experiment Design in
Statistics
By querying unlabelled data
What kind of queries?
How queries are formulated?
Query strategy frameworks
Active Learning provides a more efficient and more accurate

solutions as compared to Passive Learning
Som e M otivating Exam ples*

Learning Threshold Functions
Consider first the task of learning a
threshold function of a single

variable.
A single-variable threshold function
f : R { 1}, parametrized by the
real number R threshold value, is
defined by
*Algorithms for Active Learning
Daniel Joseph Hsu, Columbia Univ. Dissertation, 2010

Used for classifying univariate data (recall
decision stump)
Passive learner will be presented with n
labeled examples and will produce a
predictor that minimizes the number of
disagreements
That is, the learner could choose R
such that:
|{1 i n : f(xi) yi}| is minimum

For now, we assume that all of the labels
actually correspond to some threshold

function f , so yi = f(xi) for all 1 i
n.
Therefore, the learner can easily find
some threshold value R that has no
disagreements with the given examples,
i.e., |{1 i n : f(xi) yi}| = 0

An active learner can also find a threshold value R
such that f has no disagreements with the (xi, yi), and it

can do so after requesting just log2 n of the labels!
Compare with binary search!!
For the target threshold :
if a requested label yiis +1, then we can infer that xi, and
therefore yj = +1 for all xj xi;

if yi is 1, then > xi, and therefore yj = 1 for all xj xi.
Thus, one can simply choose to request the label of a point
xi at the median of the unlabeled points; this is guaranteed

to result in an outcome that lets the learner label (for free)
at least half of the other unlabeled points.


The strategy for learning single-variable threshold functions
represents a best-case scenario for active learning: just log 2 n

label requests are needed to deduce all of the n labels
What aspects of the learning problem made this possible?
At any point in the interactive process, the active learner could always
make a query (label request) that results in labeling (for free) at least
half of the other unlabeled points. Viewed another way, the query
eliminates at least half of the potential classifiers still in contention.
We crucially made an assumption that the labels y i = f(xi) correspond
to some threshold function f
Unfortunately, these aspects do not always carry over to other
learning problems: there need not always be queries that

provide the information needed for a binary search-like process,
even when the labels perfectly correspond to a simple function.


Learning Interval Functions (Do it
Yourself)
Even in the case where the labels correspond exactly to
some interval function fa,b , the active learner may

need to request all labels in order to distinguish between
intervals that include any particular xi (i.e., one for which
fa,b(xi) = +1), and an interval that includes none of the xi
(i.e., one for which fa,b(xi) = 1 for all 1 i n) [Das05].

Das05: S. Dasgupta. Coarse sample complexity bounds for
Daniel Joseph Hsu, Columbia Univ. Dissertation, active learning. In Advance in Neural Information

Learning Interval Functions (Do it
Yourself)
Consider the following two-phase strategy for learning a
single-variable interval function fa,b, also described in

[Das05].
Request the label of randomly chosen xi until some yi is found
such that yi = +1. If no yi = +1, then return the empty interval

function.
Use the binary search-like procedure for learning single-variable
threshold functions to determine the interval boundaries a and b,
and return fa,b.
The crucial observation behind this algorithm is that an
interval function can be described by two single-variable

threshold functions
Das05: S. Dasgupta. Coarse sample complexity bounds for
Daniel Joseph Hsu, Columbia Univ. Dissertation, active learning. In Advance in Neural Information

Learning Interval Functions (Do it Yourself)
The crucial observation behind this algorithm is that an interval
function can be described by two single-variable threshold functions:
The binary search for b pretends that all points to the left of positive
point xi have a negative label; the binary search for a is similar.

The first phase of the algorithm is certainly not like binary search,
but it serves the useful purpose of identifying a starting point for
binary search in the second phase.
In the worst case, the algorithm may end up querying every label
before transitioning into this second phase.
But if a significant fraction of the points are labeled +1 by fa,b ,
then the first phase ends quickly.

Daniel Joseph Hsu, Columbia Univ. Dissertation,
Types ofActive Learning

Largely falls into one of these two types:
Membership Query Synthesis
learner constructs examples for labeling
Stream-Based Active Learning

Consider one unlabeled example at a time
Decide whether to query its label or ignore it
Pool-Based Active Learning

Given: a large unlabeled pool of examples
Rank examples in order of informativeness
Query the labels for the most informative
example(s)
Active Learning Scenarios
Figure taken from Burr Settles article
M em bership Q uery Synthesis

One of the earliest AL scenarios
(Angluin 1988)
The learner may request labels for any
unlabeled instance in the input space,

including (and typically assuming)
queries that the learner generates de
novo, rather than those sampled from
some underlying natural distribution
D. Angluin. Queries and concept learning. Machine Learning,
2:319342, 1988.

Query synthesis is reasonable for many
problems
But, labeling such arbitrary instances can
be awkward if the oracle is a human
annotator
For eg.: human oracles to train a ANN to
classify handwritten characters
Many of the queries images generated by the
learner contained no recognizable symbols, only

artificial hybrid characters with no semantic
meaning

Membership queries for NLP tasks
might create stream of test or

speech that amount to gibberish
Proposed solutions:
Stream-based scenario
Pool-based scenario

Innovative Application
Robot Scientist executes a series of autonomous
biological experiments to discover metabolic

pathways in yeast
An instance is a mixture of chemical solutions that
constitutes a growth medium as well as a
particular yeast mutant
Label whether or not the mutant thrived in the
growth medium
All experiments were autonomously synthesized
and physically performed using a lab. robot.
3-fold decrease in cost

In domains where labels come not
from human annotators, but from

experiments such as this, query
synthesis may be a promising
direction for automated scientific
discovery

Stream-Based Active Learning
Figure: Slides of Piyush Rai, CS5350/6350: Machine Learning
Stream -based selective

sam pling
Alternative to synthesizing queries
Obtaining an unlabeled instance is
free or inexpensive
First sampled from the actual
distribution and the learner decide
whether or not to request its label
Stream -based selective

sam pling
How to decide whether to query or not
to query an instance?
Informativeness measure or query strategy
Region of uncertainty
Part of the instance space that is still ambiguous
to the learner
Query only those instances that fall in the region
Part of speech tagging

Learning ranking functions for IR
Word sense disambiguation

Pool-Based Active Learning
Figure: Slides of Piyush Rai, CS5350/6350: Machine Learning
Pool-based Active Learning

Starts with a small number of labeled training
set
Request labels for 1 or more carefully selected
instances
Focus on difficult to label tuple
Analogy with Boosting?
Focus on most informative instance
Greedy approach?
Uses new knowledge to choose which instances
to query next
Newly labeled instances are added to the
labeled set
Pool-based sam pling

In many real world problems, large collections
of unlabelled data, U, can be gathered at once

Small set of labeled data, L
U is assumed to be closed (static)
Instances are queried in a greedy manner
according to an informativeness measure
Text classification, image/video classification
and retrieval, speech recognition and cancer
diagnosis are examples of Pool-based Sampling
Pool-based sam pling

Main difference with stream-based:
Stream-based: scans through the data
sequentially and makes query decisions

individually
Pool-based: evaluates and ranks the entire
collection before selecting the best query
Pool-based scenarios are more common!

Settings where stream-based is more
appropriate??
When memory or processing power is limited,
as with mobile and embedded devices
PotentialofActive Learning
An illustrative example of pool-based active learning

(a) A toy data set of 400 instances, evenly sampled from two
class Gaussians centered at (-2,0) & (2,0) & standard deviation
=1
(b) A logistic regression model trained with 30 labeled instances
randomly drawn from the problem domain (70% accuracy)
(c) A logistic regression model trained with 30 actively queried
instances using
uncertainty sampling (90%).
In (b) random
selection of 30 unlabeled instances drawn iid
PotentialofActive Learning
Active Learners use uncertainty sampling to
focus on instances closest to the decision

boundary
Something similar we do in SVM?
D ocum ent Classifi

cation
Learner has to distinguish between
BASEBALL & HOCKEY documents

20 newsgroups corpus
2000 Usenet documents, equally
divided among the two classes
D ocum ent Classifi

cation
Learning curves: baseball vs. hockey.

Curves plot classification accuracy as a function of the number of
documents queried for two selection strategies: uncertainty sampling
(active learning) and random sampling (passive learning).
Figure taken from Burr Settles
Learning Curves
Active learning algorithms are
evaluated by constructing learning

curves
Evaluation metric (for eg. Accuracy)
as a function of the number of new
instance queries that are labeled and
added to
Uncertainty sampling query strategy
vs. random sampling
H ow Active Learning W orks?

Active Learning proceeds in rounds
Each round has a current model (learned using the
labeled data seen so far)

The current model is used to assess informativeness of
unlabeled examples
using one of the query selection strategies
The most informative example(s) is/are selected

The labels are obtained (by the labeling oracle)
The (now) labeled example(s) is/are included in the
training data
The model is re-trained using the new training data
The process repeat until we have budget left for getting
labels or we have attained the desired accuracy!
Q uery Selection Strategies

Any Active Learning algorithm requires a
query selection strategy. Some
examples:
Uncertainty Sampling
Query By Committee (QBC)
Expected Model Change
Expected Error Reduction
Variance Reduction
Density Weighted Methods
Q uery Strategy Fram ew orks

All AL scenarios involve evaluating
the informativeness of unlabeled

instances
Many proposed solutions for
formulating such query strategies
X*A - most informative instance
(i.e., the best query) according to
some query selection algorithm A
U ncertainty Sam pling

[Lewis & Gale, 1994]
Query the event that the current classifier is most
uncertain about
If uncertainty is measured in Euclidean distance:
x x
x x
Used trivially in SVMs, graphical models, etc.
Figure courtesy:
Irina Rish, IBM T.J. Watson Research Center
Uncertainty sam pling

Active learner queries instances about
which it is least certain how to label

probabilistic model binary
classification uncertainty sampling
queries the instance whose posterior
probability is close to 0.5
3 or more class labels:

Least confident strategy only considers
information about the most probable labels

Throws away information about remaining
label distribution
Enter Margin Sampling
Still not a good strategy for problems with large
label sets

Entropy as an uncertainty measure:
Reduces to Least confident and
Margin sampling for binary

classification problems
All 3 strategies are equivalent
querying the instance with a class
posterior closest to 0.5
Q uery by Com m ittee (Q BC)

QBC approach involves maintaining a
committee of models which are all trained on

the current labeled data L, but represent
competing hypotheses
Each committee member is allowed to vote
on the labelings of query candidates

Most informative query is one about which
they most disagree

Minimize the version space
Version space is the region that is still
unknown to the overall model class, i.e.,

Version space is the set of hypotheses that
are consistent with the current labeled
training data L
In other words, if any two models of the same
model class (but different parameter settings)
agree on all the labeled data, but disagree on
some unlabeled instance, then that instance
lies within the region of uncertainty

In ML, we search for the best model
in version space
In AL, we try to constrain the size of
the version space as much as
possible
Why?
So that the search can be more
precise with as few labeled
instances as possible
Q uery by Com m ittee(Q BC) Version Space

To implement QBC algorithm, we
must:
Be able to construct a committee of
models that represent different

regions of the version space
Have some measure of disagreement
among committee members

Construction of committee of
models
Boosting & Bagging

Measure of disagreement:
Vote Entropy
QBC generalization of entropy-based
uncertainty sampling
Q uery by Com m ittee
[Seung et al. 1992, Freund et al. 1997]
Prior distribution over hypotheses

Samples a set of classifiers from distribution
Queries an example based on the degree of
disagreement between committee of classifiers

x
x x
x x
Figure courtesy:
Irina Rish, IBM T.J. Watson Research Center

Which unlabelled point should you
choose?
Slides by Barbara Engelhardt and Alex Shyr

Yellow = valid hypotheses

Point on max-margin hyperplane does
not reduce the number of valid

hypotheses by much

Queries an example based on the
degree of disagreement between

committee of classifiers

Prior distribution over
classifiers/hypotheses
Sample a set of classifiers from distribution
Natural for ensemble methods which are
already samples
Random forests, Bagged classifiers, etc.
Measures of disagreement
Entropy of predicted responses
W eb Searching
A Web based company wishes to gather
particular types of pages (e.g., pages containing

lists of peoples publications). It employs a number of
people to hand-label some web pages so as to create
a training set for an automatic classifier that will
eventually be used to classify and extract pages from
the rest of the web.
Since human expertise is a limited resource, the
company wishes to reduce the number of pages the
employees have to label. Rather than labeling pages
randomly drawn from the web, the computer uses
active learning to request targeted pages that it
believes will be most informative to label.
Personalized Em ailFilter
The user wishes to create a personalized
automatic junk email filter

In the learning phase the automatic learner
has access to the users past email files.
Using active learning, it interactively brings up
a past email and asks the user whether the
displayed email is junk mail or not. Based on
the users answer it brings up another email
and queries the user
The process is repeated some number of
times and the result is an email filter tailored
to that specific person.
Relevance feedback
The user wishes to sort through a
database/website for items (images, articles,

etc.) that are of personal interest; an Ill know
it when I see it type of search
The computer displays an item and the user
tells the learner whether the item is interesting
or not
Based on the users answer the learner brings
up another item from the database. After some
number of queries the learner then returns a
number of items in the database that it
believes will be of interest to the user
Active Learning
Happy ACTIVE LEARNING from now
on!!

Active Learning NG

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Active Learning NG

Uploaded by

Copyright:

Available Formats

A C TIV E LEA R N IN G

Slides developed using material from:

accuracy with less training, would you

It is possible with ACTIVE LEARNING!

For all supervised & unsupervised learning

tasks, we first need to gather significant

Figure taken from Simon Tongs PhD Thesis

consuming, & expensive to obtain

the labeling bottleneck by asking queries in the

possible candidates that we have the option of

An active learner differs from a passive

information by sitting and listening to a

learner and a passive learner is the ability

learning algorithm is allowed to

tuples from a large pool

Active Learning provides a more efficient and more accurate

Som e M otivating Exam ples*

threshold function of a single

Som e M otivating Exam ples*

Som e M otivating Exam ples*

actually correspond to some threshold

Som e M otivating Exam ples*

such that f has no disagreements with the (xi, yi), and it

therefore yj = +1 for all xj xi;

Thus, one can simply choose to request the label of a point

xi at the median of the unlabeled points; this is guaranteed

*Algorithms for Active Learning

Som e M otivating Exam ples*

represents a best-case scenario for active learning: just log 2 n

Unfortunately, these aspects do not always carry over to other

learning problems: there need not always be queries that

*Algorithms for Active Learning

Som e M otivating Exam ples*

Even in the case where the labels correspond exactly to

some interval function fa,b , the active learner may

*Algorithms for Active Learning

Som e M otivating Exam ples*

single-variable interval function fa,b, also described in

such that yi = +1. If no yi = +1, then return the empty interval

The crucial observation behind this algorithm is that an

interval function can be described by two single-variable

Som e M otivating Exam ples*

function can be described by two single-variable threshold functions:

point xi have a negative label; the binary search for a is similar.

*Algorithms for Active Learning

Types ofActive Learning

Stream-Based Active Learning

Pool-Based Active Learning

Active Learning Scenarios

Figure taken from Burr Settles article

M em bership Q uery Synthesis

unlabeled instance in the input space,

M em bership Q uery Synthesis

learner contained no recognizable symbols, only

M em bership Q uery Synthesis

might create stream of test or

M em bership Q uery Synthesis

biological experiments to discover metabolic

M em bership Q uery Synthesis

from human annotators, but from

Types ofActive Learning

Figure: Slides of Piyush Rai, CS5350/6350: Machine Learning

Stream -based selective

Stream -based selective