You are on page 1of 70

A C TIV E LEA R N IN G

Navneet Goyal

Slides developed using material from:


1. Simon Tong, ACTIVE LEARNING: THEORY AND APPLICATIONS.
Ph.D. dissertation, Stanford University, August, 2001.
2. Burr Settles. ACTIVE LEARNING LITERATURE SURVEY. Computer
Sciences Technical Report 1648, University of Wisconsin
Madison. 2009

Introduction
If I tell you that you can achieve better

accuracy with less training, would you


believe me?
NO!!
It is possible when the learning
algorithm is:
Allowed to be curious
Allowed to choose the data from which it

learns

It is possible with ACTIVE LEARNING!

Introduction
Majority of ML tasks fall under:
Supervised Learning (for eg. Classification)
Unsupervised Learning (for eg. Clustering &

Model Building)

For all supervised & unsupervised learning

tasks, we first need to gather significant


amount of data randomly sampled from
the underlying population distribution
This is PASSIVE learning!!
So what is ACTIVE learning?

Passive Learning

Figure taken from Simon Tongs PhD Thesis

Introduction
One of the most resource intensive task

is gathering of data!
In most cases, we have limited resources
for collecting data
Try to make the best use of these
resources
Randomly collected data instances are
independent & identically distributed (iid)
Can we guide the sampling process?

Introduction
In most cases, data is abundantly

available
Mails, images, videos, songs, speeches,
documents, ratings, tweets, etc.
Which of these are different from others?
Mails & ratings
Labeled data is freely available

Others?
Labeled instances are very difficult, time

consuming, & expensive to obtain

Introduction
Some Examples where labeled data is
hard to come by:
Speech Recognition
Document Classification
Image & Video annotation

Introduction
Speech Recognition
Accurate labeling of speech utterances is
extremely time consuming and requires
trained linguists
Annotation at the word level can take ten
times longer than the actual audio (e.g.,
one minute of speech takes ten minutes to
label), and annotating phonemes can take
400 times as long (e.g., nearly seven hours)
The problem is compounded for rare
languages or dialects

Labeling bottleneck
Active learning systems attempt to overcome

the labeling bottleneck by asking queries in the


form of unlabeled instances to be labeled by an
oracle (e.g., a human annotator)
Active learner aims to achieve high accuracy
using as few labeled instances as possible,
thereby minimizing the cost of obtaining labeled
data
Active learning is well-motivated in many
modern machine learning problems where data
may be abundant but labels are scarce or
expensive to obtain

Introduction
Document classification
Large pool of unlabelled documents

available
Randomly pick documents to be
labeled manually
OR
Carefully choose (or query) from the
pool that are to be labeled

Introduction
Parameter estimation and structure discovery tasks
Studying lung cancer in a medical setting
preliminary list of the ages and smoking habits of

possible candidates that we have the option of


further examining.
Ability/resources to give only a few people a
thorough examination
Instead of randomly choosing a subset of the
candidate population to examine we may query for
candidates that fit certain profiles (e.g., We want
to examine someone who is over 50 & who
smokes).

Active Learning
We need not fix our desired queries in

advance
Instead, we can choose our next query
based upon the answers to our
previous queries
The process of guiding the sampling
process by querying for certain types of
instances based upon the data that we
have seen so far is called active
learning

Active Learning

An active learner differs from a passive


learner which simply receives a random data
set from the world and then outputs a
classifier or model
Figure taken from Simon Tongs PhD Thesis

Active Learning
An interesting analogy!
A passive learner is a student that gathers

information by sitting and listening to a


teacher while an active learner is a student
that asks the teacher questions, listens to the
answers and asks further questions based
upon the teachers response
This extra ability to adaptively query the world
based upon past responses would allow an
active learner to perform better than a
passive learner

Active Learning
The core difference between an active

learner and a passive learner is the ability


to ask queries about the world based
upon the past queries and responses
The notion of what exactly a query is and
what response it receives will depend
upon the exact task at hand
The possibility of using active learning can
arise naturally in a variety of domains &
in several variants

Active Learning
The key hypothesis is that if the

learning algorithm is allowed to


choose the data from which it learns
to be curious , if you willit will
perform better with less training
A curious student generally
performs well!!
Do you agree??
You better agree and become a
curious student

Active Learning
ML algorithms choose the training

tuples from a large pool


What do they gain by doing so?
Improved Accuracy?
If YES, How?

Active Learning
Also called Query Learning in ML
Optimal Experiment Design in

Statistics
By querying unlabelled data
What kind of queries?
How queries are formulated?
Query strategy frameworks

Active Learning provides a more efficient and more accurate


solutions as compared to Passive Learning

Som e M otivating Exam ples*


Learning Threshold Functions
Consider first the task of learning a

threshold function of a single


variable.
A single-variable threshold function
f : R { 1}, parametrized by the
real number R threshold value, is
defined by
*Algorithms for Active Learning
Daniel Joseph Hsu, Columbia Univ. Dissertation, 2010

Som e M otivating Exam ples*


Learning Threshold Functions
Used for classifying univariate data (recall

decision stump)
Passive learner will be presented with n
labeled examples and will produce a
predictor that minimizes the number of
disagreements
That is, the learner could choose R
such that:
|{1 i n : f(xi) yi}| is minimum
*Algorithms for Active Learning
Daniel Joseph Hsu, Columbia Univ. Dissertation, 2010

Som e M otivating Exam ples*


Learning Threshold Functions
For now, we assume that all of the labels

actually correspond to some threshold


function f , so yi = f(xi) for all 1 i
n.
Therefore, the learner can easily find
some threshold value R that has no
disagreements with the given examples,
i.e., |{1 i n : f(xi) yi}| = 0
*Algorithms for Active Learning
Daniel Joseph Hsu, Columbia Univ. Dissertation, 2010

Som e M otivating Exam ples*


Learning Threshold Functions
An active learner can also find a threshold value R

such that f has no disagreements with the (xi, yi), and it


can do so after requesting just log2 n of the labels!
Compare with binary search!!
For the target threshold :
if a requested label yiis +1, then we can infer that xi, and

therefore yj = +1 for all xj xi;


if yi is 1, then > xi, and therefore yj = 1 for all xj xi.

Thus, one can simply choose to request the label of a point

xi at the median of the unlabeled points; this is guaranteed


to result in an outcome that lets the learner label (for free)
at least half of the other unlabeled points.

*Algorithms for Active Learning


Daniel Joseph Hsu, Columbia Univ. Dissertation, 2010

Som e M otivating Exam ples*


Learning Threshold Functions
The strategy for learning single-variable threshold functions

represents a best-case scenario for active learning: just log 2 n


label requests are needed to deduce all of the n labels
What aspects of the learning problem made this possible?
At any point in the interactive process, the active learner could always

make a query (label request) that results in labeling (for free) at least
half of the other unlabeled points. Viewed another way, the query
eliminates at least half of the potential classifiers still in contention.
We crucially made an assumption that the labels y i = f(xi) correspond
to some threshold function f

Unfortunately, these aspects do not always carry over to other

learning problems: there need not always be queries that


provide the information needed for a binary search-like process,
even when the labels perfectly correspond to a simple function.

*Algorithms for Active Learning


Daniel Joseph Hsu, Columbia Univ. Dissertation, 2010

Som e M otivating Exam ples*


Learning Interval Functions (Do it
Yourself)

Even in the case where the labels correspond exactly to

some interval function fa,b , the active learner may


need to request all labels in order to distinguish between
intervals that include any particular xi (i.e., one for which
fa,b(xi) = +1), and an interval that includes none of the xi
(i.e., one for which fa,b(xi) = 1 for all 1 i n) [Das05].

*Algorithms for Active Learning


Das05: S. Dasgupta. Coarse sample complexity bounds for
Daniel Joseph Hsu, Columbia Univ. Dissertation, active learning. In Advance in Neural Information

Som e M otivating Exam ples*


Learning Interval Functions (Do it
Yourself)
Consider the following two-phase strategy for learning a

single-variable interval function fa,b, also described in


[Das05].
Request the label of randomly chosen xi until some yi is found

such that yi = +1. If no yi = +1, then return the empty interval


function.
Use the binary search-like procedure for learning single-variable
threshold functions to determine the interval boundaries a and b,
and return fa,b.

The crucial observation behind this algorithm is that an

interval function can be described by two single-variable


threshold functions
*Algorithms for Active Learning
Das05: S. Dasgupta. Coarse sample complexity bounds for
Daniel Joseph Hsu, Columbia Univ. Dissertation, active learning. In Advance in Neural Information

Som e M otivating Exam ples*


Learning Interval Functions (Do it Yourself)
The crucial observation behind this algorithm is that an interval

function can be described by two single-variable threshold functions:

The binary search for b pretends that all points to the left of positive

point xi have a negative label; the binary search for a is similar.


The first phase of the algorithm is certainly not like binary search,
but it serves the useful purpose of identifying a starting point for
binary search in the second phase.
In the worst case, the algorithm may end up querying every label
before transitioning into this second phase.
But if a significant fraction of the points are labeled +1 by fa,b ,
then the first phase ends quickly.

*Algorithms for Active Learning


Daniel Joseph Hsu, Columbia Univ. Dissertation,

Types ofActive Learning


Largely falls into one of these two types:
Membership Query Synthesis
learner constructs examples for labeling

Stream-Based Active Learning


Consider one unlabeled example at a time
Decide whether to query its label or ignore it

Pool-Based Active Learning


Given: a large unlabeled pool of examples
Rank examples in order of informativeness
Query the labels for the most informative

example(s)

Active Learning Scenarios

Figure taken from Burr Settles article

M em bership Q uery Synthesis


One of the earliest AL scenarios

(Angluin 1988)
The learner may request labels for any

unlabeled instance in the input space,


including (and typically assuming)
queries that the learner generates de
novo, rather than those sampled from
some underlying natural distribution
D. Angluin. Queries and concept learning. Machine Learning,
2:319342, 1988.

M em bership Q uery Synthesis


Query synthesis is reasonable for many

problems
But, labeling such arbitrary instances can
be awkward if the oracle is a human
annotator
For eg.: human oracles to train a ANN to
classify handwritten characters
Many of the queries images generated by the

learner contained no recognizable symbols, only


artificial hybrid characters with no semantic
meaning

M em bership Q uery Synthesis


Membership queries for NLP tasks

might create stream of test or


speech that amount to gibberish
Proposed solutions:
Stream-based scenario
Pool-based scenario

M em bership Q uery Synthesis


Innovative Application
Robot Scientist executes a series of autonomous

biological experiments to discover metabolic


pathways in yeast
An instance is a mixture of chemical solutions that
constitutes a growth medium as well as a
particular yeast mutant
Label whether or not the mutant thrived in the
growth medium
All experiments were autonomously synthesized
and physically performed using a lab. robot.
3-fold decrease in cost

M em bership Q uery Synthesis


In domains where labels come not

from human annotators, but from


experiments such as this, query
synthesis may be a promising
direction for automated scientific
discovery

Types ofActive Learning


Stream-Based Active Learning

Figure: Slides of Piyush Rai, CS5350/6350: Machine Learning

Stream -based selective


sam pling
Alternative to synthesizing queries
Obtaining an unlabeled instance is

free or inexpensive
First sampled from the actual
distribution and the learner decide
whether or not to request its label

Stream -based selective


sam pling
How to decide whether to query or not

to query an instance?
Informativeness measure or query strategy
Region of uncertainty
Part of the instance space that is still ambiguous
to the learner
Query only those instances that fall in the region

Part of speech tagging


Learning ranking functions for IR
Word sense disambiguation

Types ofActive Learning


Pool-Based Active Learning

Figure: Slides of Piyush Rai, CS5350/6350: Machine Learning

Pool-based Active Learning


Starts with a small number of labeled training

set
Request labels for 1 or more carefully selected
instances
Focus on difficult to label tuple
Analogy with Boosting?
Focus on most informative instance
Greedy approach?

Uses new knowledge to choose which instances

to query next
Newly labeled instances are added to the
labeled set

Pool-based sam pling


In many real world problems, large collections

of unlabelled data, U, can be gathered at once


Small set of labeled data, L
U is assumed to be closed (static)
Instances are queried in a greedy manner
according to an informativeness measure
Text classification, image/video classification
and retrieval, speech recognition and cancer
diagnosis are examples of Pool-based Sampling

Pool-based sam pling


Main difference with stream-based:
Stream-based: scans through the data

sequentially and makes query decisions


individually
Pool-based: evaluates and ranks the entire
collection before selecting the best query

Pool-based scenarios are more common!


Settings where stream-based is more

appropriate??
When memory or processing power is limited,
as with mobile and embedded devices

PotentialofActive Learning

An illustrative example of pool-based active learning


(a) A toy data set of 400 instances, evenly sampled from two
class Gaussians centered at (-2,0) & (2,0) & standard deviation
=1
(b) A logistic regression model trained with 30 labeled instances
randomly drawn from the problem domain (70% accuracy)
(c) A logistic regression model trained with 30 actively queried
instances using
uncertainty sampling (90%).
In (b) random
selection of 30 unlabeled instances drawn iid
Figure taken from Burr Settles article

PotentialofActive Learning

Active Learners use uncertainty sampling to

focus on instances closest to the decision


boundary
Something similar we do in SVM?

Figure taken from Burr Settles article

D ocum ent Classifi


cation
Learner has to distinguish between

BASEBALL & HOCKEY documents


20 newsgroups corpus
2000 Usenet documents, equally
divided among the two classes

D ocum ent Classifi


cation

Learning curves: baseball vs. hockey.


Curves plot classification accuracy as a function of the number of
documents queried for two selection strategies: uncertainty sampling
(active learning) and random sampling (passive learning).
Figure taken from Burr Settles

Learning Curves
Active learning algorithms are

evaluated by constructing learning


curves
Evaluation metric (for eg. Accuracy)
as a function of the number of new
instance queries that are labeled and
added to
Uncertainty sampling query strategy
vs. random sampling

H ow Active Learning W orks?


Active Learning proceeds in rounds
Each round has a current model (learned using the

labeled data seen so far)


The current model is used to assess informativeness of
unlabeled examples
using one of the query selection strategies

The most informative example(s) is/are selected


The labels are obtained (by the labeling oracle)
The (now) labeled example(s) is/are included in the

training data
The model is re-trained using the new training data
The process repeat until we have budget left for getting
labels or we have attained the desired accuracy!

Q uery Selection Strategies


Any Active Learning algorithm requires a
query selection strategy. Some
examples:
Uncertainty Sampling
Query By Committee (QBC)
Expected Model Change
Expected Error Reduction
Variance Reduction
Density Weighted Methods

Q uery Strategy Fram ew orks


All AL scenarios involve evaluating

the informativeness of unlabeled


instances
Many proposed solutions for
formulating such query strategies
X*A - most informative instance
(i.e., the best query) according to
some query selection algorithm A

U ncertainty Sam pling


[Lewis & Gale, 1994]
Query the event that the current classifier is most

uncertain about

If uncertainty is measured in Euclidean distance:

x x

x x

Used trivially in SVMs, graphical models, etc.

Figure courtesy:
Irina Rish, IBM T.J. Watson Research Center

Uncertainty sam pling


Active learner queries instances about

which it is least certain how to label


probabilistic model binary
classification uncertainty sampling
queries the instance whose posterior
probability is close to 0.5
3 or more class labels:

Uncertainty sam pling


Least confident strategy only considers

information about the most probable labels


Throws away information about remaining
label distribution
Enter Margin Sampling

Still not a good strategy for problems with large

label sets

Uncertainty sam pling


Entropy as an uncertainty measure:

Reduces to Least confident and

Margin sampling for binary


classification problems
All 3 strategies are equivalent
querying the instance with a class
posterior closest to 0.5

Uncertainty sam pling

Q uery by Com m ittee (Q BC)


QBC approach involves maintaining a

committee of models which are all trained on


the current labeled data L, but represent
competing hypotheses

Each committee member is allowed to vote

on the labelings of query candidates


Most informative query is one about which
they most disagree

Q uery by Com m ittee (Q BC)


Minimize the version space
Version space is the region that is still

unknown to the overall model class, i.e.,


Version space is the set of hypotheses that
are consistent with the current labeled
training data L
In other words, if any two models of the same
model class (but different parameter settings)
agree on all the labeled data, but disagree on
some unlabeled instance, then that instance
lies within the region of uncertainty

Q uery by Com m ittee (Q BC)


In ML, we search for the best model

in version space
In AL, we try to constrain the size of
the version space as much as
possible
Why?
So that the search can be more
precise with as few labeled
instances as possible

Q uery by Com m ittee(Q BC) Version Space

Q uery by Com m ittee (Q BC)


To implement QBC algorithm, we

must:
Be able to construct a committee of

models that represent different


regions of the version space
Have some measure of disagreement
among committee members

Q uery by Com m ittee (Q BC)


Construction of committee of

models
Boosting & Bagging

Q uery by Com m ittee (Q BC)


Measure of disagreement:
Vote Entropy

QBC generalization of entropy-based

uncertainty sampling

Q uery by Com m ittee

[Seung et al. 1992, Freund et al. 1997]

Prior distribution over hypotheses


Samples a set of classifiers from distribution
Queries an example based on the degree of

disagreement between committee of classifiers


x

x x

x x

Figure courtesy:
Irina Rish, IBM T.J. Watson Research Center

Q uery by Com m ittee


Which unlabelled point should you

choose?

Slides by Barbara Engelhardt and Alex Shyr

Q uery by Com m ittee


Yellow = valid hypotheses

Slides by Barbara Engelhardt and Alex Shyr

Q uery by Com m ittee


Point on max-margin hyperplane does

not reduce the number of valid


hypotheses by much

Slides by Barbara Engelhardt and Alex Shyr

Q uery by Com m ittee


Queries an example based on the

degree of disagreement between


committee of classifiers

Slides by Barbara Engelhardt and Alex Shyr

Q uery by Com m ittee


Prior distribution over

classifiers/hypotheses
Sample a set of classifiers from distribution
Natural for ensemble methods which are

already samples
Random forests, Bagged classifiers, etc.

Measures of disagreement
Entropy of predicted responses

W eb Searching
A Web based company wishes to gather

particular types of pages (e.g., pages containing


lists of peoples publications). It employs a number of
people to hand-label some web pages so as to create
a training set for an automatic classifier that will
eventually be used to classify and extract pages from
the rest of the web.
Since human expertise is a limited resource, the
company wishes to reduce the number of pages the
employees have to label. Rather than labeling pages
randomly drawn from the web, the computer uses
active learning to request targeted pages that it
believes will be most informative to label.

Personalized Em ailFilter
The user wishes to create a personalized

automatic junk email filter


In the learning phase the automatic learner
has access to the users past email files.
Using active learning, it interactively brings up
a past email and asks the user whether the
displayed email is junk mail or not. Based on
the users answer it brings up another email
and queries the user
The process is repeated some number of
times and the result is an email filter tailored
to that specific person.

Relevance feedback
The user wishes to sort through a

database/website for items (images, articles,


etc.) that are of personal interest; an Ill know
it when I see it type of search
The computer displays an item and the user
tells the learner whether the item is interesting
or not
Based on the users answer the learner brings
up another item from the database. After some
number of queries the learner then returns a
number of items in the database that it
believes will be of interest to the user

Active Learning
Happy ACTIVE LEARNING from now

on!!

You might also like