You are on page 1of 6

» Prepared by: Gabrielle Anderson C / V Is 13 |ndex.

Type bund|e |ndexhere


'Date: 4/17/02 DOC Library: Type library name here
Job Code: 320087 DOC Number: Type document number here

2nd CLASS Briefing


Reviewed by: Type reviewer name here
Review Date: Type review date here

Record of Interview

Purpose To find out how the CLASS namecheck system operates

Contact Method In-person meeting

Contact Place State Department, Consular Affairs Bureau

Contact Date March 27, 2002

Participants State:
Dave Williams
Cathy Baskay

GAP:
Judy McCloskey
Jody Woods
Kate Brentzel
Gabrielle Anderson
Richard Hung

We received an explanation of the principal techniques that are used in


the CLASS namecheck system. We also learned that the reason for the
failure of the original Al-Jiddi namecheck was most likely a country-
relationship table that did not take into account a possible country
association between Canada and Tunisia. Finally, we discussed the
resource issues that would be involved if biometrics were introduced into
the system.

Architectural Concepts Mr. Williams informed us that there are five basic questions that need to
be answered when constructing a namecheck system.
of Name Searching
The first question is whether or not each namecheck query will consult all
the records in the system. In the case of CLASS, there would clearly not
be enough capacity for all 6 million records in the system to be scanned
during one namecheck, especially with many namechecks being
conducted each day.

Therefore, the second question involves determining the criteria for


establishing what subset of these 6 million that is to be searched. This
determination is referred to as Phase I.

Pagel Record of Interview


I by: Gabnelie Anaerson
s: 4/17/02 DOC Library: Type
i Code:320087 DOC Number: Type

The third question to be answered concerns what techniques (linguistic


and logic) are to be used to evaluate that particular subset of records.
This is considered Phase II.

The fourth question concerns the criteria required to constitute a "hit,"


i.e., what is considered a close enough match to be returned as a hit. As
the subset of records is being evaluated, each record receives points
based on how close of a match it is. How many points must a hit receive
in order to be considered a legitimate hit?

The fifth question concerns the way in which the resulting hit list will be
ordered, e.g., with exact name matches first, or with CLASS I hits first?

Mr. Williams ran through some of the techniques that may be used in
Namechecking order to run a namecheck system. He stressed that no system will use just
Techniques one of these techniques and that each technique should be considered as a
tool. Since each technique has its strengths and weaknesses, a good
namecheck system will combine a variety of them in order to achieve the
best possible results. He also stressed that visa adjudication involves a
good deal of subjective decision-making on the part of the consular
officer.

Also, the information that is required when performing a CLASS


namecheck is surname and gender. Additional information for a
namecheck is preferred, but not required, e.g., estimated date of birth,
country of birth and first name.

Name Compression: """


This technique takes the first letter of a surname, drops all its vowels and
reduces any double consonants to a single consonant. The system will
then return any surnames that fit this pattern. Its strength lies in the fact
that it is fast and precise. However, this technique produces near misses
if a surname is spelled slightly differently (e.g., reducing Gutierrez to
GTRZ would miss Gutierres, which would be compressed to GTRS). If, in
an attempt to account for these near misses, you were to require that the
system return all matches within one character, you would pull in far too
many hits (since there is a maximum of 6 characters in a compressed
name.) Another weakness of this technique is that it does not work well
on short names, e.g., Lee.

Svnonvm Association:
s
This technique can be used with several of the namecheck fields. For
example, synonym association can be used to establish a relationship
between the name "Joe" and its derivations such as Joey, Jose, Joseph,
Guiseppe, etc. Thus, a search for Joe would turn up not only persons with
this exact first name but also those whose name was one of these
derivatives, hi the case of country, Russia has been equated with all of the
former Soviet republics, so that a search for "Russia" will result in initial
hits on any of the current independent republics, e.g., Azerbaijan, Belarus,
Estonia, Georgia, etc. Additionally, the synonym association technique

Page 2 Record of Interview


pared by: Gabrielle Anderson
ate: 4/17/02 DOC Library: Type
i Code: 320087 DOC Number: Type

ensures that surname qualifiers (e.g., Van, De, Al, etc.) are separated out
when namechecks are performed.

N Gram Analysis: \X'


The bi-gram analysis breaks down a surname by two letters at a time. For
example, Gutierrez is broken down into "_G, GU, UT, TI IE, ER, RR, RE,
EZ, Z_." This particular technique compares the bi-gram for the desired
name with the bi-grams for all other names in the data subset. At present
in CLASS, if half of the bi-grams in a particular name match the bi-grams
in the desired name, then this name is returned as a hit. However, the
level required to return a hit based upon this bi-gram analysis can be
changed. The same is true for tri-gram analysis, which is identical except
that it breaks down a surname into three-letter components. The strength
of the N-gram technique is that it is highly tunable, but its weakness lies in
the fact that it has a low level of discrimination. Hence, the N-gram
analysis is a coarse method, one that is used to develop subsections of
V""
data rather than to produce the desired "hit."

Position Discounting:
,
This technique allows you to determine how many of the bi-gram or tri-
gram hits fall into the same position as they do in the desired name. For
example, a namecheck on "Wilson," using a simple bi-gram analysis,
would return "Sonils" as a hit (since 4 of the 7 bi-grams in these names
match). However, when position discounting is used along with the bi-
gram analysis, "Sonils" is rejected as a hit, since none of the matching bi-
grams in "Sonils" occupy the same positions as they do in "Wilson."

Component Comparison:
This technique assigns a value to surname endings based on the likelihood
that a surname with a particular ending belongs to someone from a
particular country. For example, the Russian surname ending in "-ichna"
is assigned a value of 0.93, indicating that there is a 93% likelihood that a
person whose surname ends in "-ichna" is from a Russian-speaking or
Slavic country. Then it is clear that the most appropriate a^oritiim to use
is the Russian/Slavic algorithm.

Another component comparison technique to determine the appropriate


algorithm is the tri-gram probability table. In this table, all the possible tri-
gram combinations in the alphabet (from "_AA" to "ZZ_") are listed, along
with percentages that indicate to which linguistic algorithm they are likely
to belong. For example, with the tri-gram "MAS," there is a 38.5%
likelihood that a name containing this tri-gram will be Russian/Slavic and
a 46.9% likelihood that it will be will be Arabic. This is a tool to select out
what algorithms to apply in each namecheck case.

Cultural Regularization: \/
This technique involves transliterating a name from its foreign alphabet
spelling into the many forms it could take using the Roman alphabet.
AO< \, Qadafi, Khadafi, Cadhafi, etc.) This ensures that one spelling of

PageS Record of Intervie


Spared by: Gabrielle Anderson
fate: 4/17/02 DOC Library: Type
i Code: 320087 DOC Number: Type

possible spellings for that Arabic name have been entered.

Letter Based Re-Write Rules:


This is an alternative way of addressing the issue of names with multiple
transliterations. This technique tries to regularize all spellings of a name
into a single entry. It does so by assigning a standard spelling to the
phonetic sounds that make up the name. For example, the system will
convert Mafouz, Mahfoudh, and Mehfouth into Mahfouz for searching
purposes. Letter based re-write rules are currently being used for Arabic
names. Both the strength and the weakness of this technique lie in its
global reach. Although the technique prevents you from having to enter in
every possible spelling of a name, it is also likely to pull in a vast number
of hits (e.g., with Arabic or Hispanic names) precisely because the system
recognizes only one version of the name.
\' will turn up other versions of the same name, provid
Phonetic Transcription:
This particular technique assigns a phonetic spelling to every name, e.g.,
'Stephen' becomes 'Steven.' This is useful because, when presented with
unfamiliar names, people tend to spell phonetically. Many names received
from the intelligence community are spelled phonetically sincie they are
often names that are overheard. However, the use of phonetic
transcription, which is tonal in nature, may require significant manual
oversight.

Edit Distance Algorithm: >--


This technique measures how many edits are necessary to change a name
in the system into the desired name, i.e., what it takes to make the two
names equal. For example, if you enter "Waldmirr,'' the edit-distance
algorithm will take this name and compare it to a name in the system such
as Vladimir. It will determine how many edits need to be done in order to
change Waldmirr into Vladimir. In this case, there are 4 edit operations
\ that need to take place: substitution (of 'V for 'W); insertion (of the
middle T in Vladimir); deletion (of the extra 'R' in Waldmirr); and reversal
(of the 'AL' to 'LA'). Next, the technique looks at the positions of these
changes within the two names and assigns values to the distances
between them. Using a formula to assess both the number of edits and
the distances between them in the two names, the namecheck system will
return Vladimir as a hit for Waldmirr. However, if the bi-gram method
were used on this particular example, the name Vladimir would not have
been returned as a hit. The edit-distance algorithm is a very strong
technique; it is, in fact, the primary technique used in spellchecker. Its
weakness is that it is machine-intensive.

We asked Mr. Williams about the Al-Jiddi namecheck done earlier this
Al-Jiddi Namecheck year by the U.S. Consulate in Montreal. They ran a namecheck on Al-
Jiddi, a known Al-Qaeda terrorist, entering in his known name, country of
birth, estimated date of birth, and current nationality. This did not result
in a hit. Only after country of birth and nationality were left blank, did the
system return a CLASS n hit for Al-Jiddi.

Page 4 Record of Interviei


-pared by: Gabrieile Anderson
Jate: 4/17/02 DOC Library: Type
Job Code: 320087 DOC Number: Type

Mr. Williams gave the likely reason for this. When setting up the
namecheck system as w hole, one of the first problems that must be
addressed is establishing the criteria that will determine which records
(out of 6 million) will be checked. This is Phase I of the search, i.e., when
CLASS establishes a searchable subset of the 6 million total names. One
of the most important criteria used in Phase I is the country field. In
Phase I, the country field is analyzed using country-relationship tables.
These tables indicate the likelihood that a person from the country
entered in the search will also possess biographical data from another
country. The country-relationship tables in CLASS do not indicate that a
person of Canadian citizenship is likely to have a Tunisian background.
Hence, Al-Jiddi's record was thrown out in Phase I, i.e., it was not
included in the subset of names that were then searched. Once the
country fields were left blank, the country-relationship tables were not
used to establish a subset and therefore Al-Jiddi's record was returned as
a hit.

However, Mr. Williams mentioned that attempting to fix a problem, such


as that posed by the Al-Jiddi namecheck, could have unintended
consequences. Re-establishing the threshold for the subset may pull in Al-
Jiddi's record but may very well pull in a great deal more records that will
also have to be examined.
Country-Relationship Tables In terms of establishing these country-relationship tables in the first place,
Mr. Williams stated that they rely on officers in the field to report back to
the Visa Office on migration patterns (which determine country
associations.) Based on this new information, the Visa Office can adjust
the table relationships. These country-relationships do not have to be
reciprocal. The last time such an adjustment took place was under John
Brennan's predecessor.

CLASS There are about 4 major CLASS releases each year, e.g., screen changes,
table changes, or new algorithms. Posts have access to the same
algorithms that exist at headquarters. The algorithms currently running in
CLASS are: Russian/Slavic; Arabic; Hispanic; generic; date of birth; and
country of birth. Linguistics teams usually put together four groups of
names to test the various algorithms, but it is important to note that they
cannot test outliers.

Mr. Williams mentioned that on April 22"", there would be a 4-day CLASS
course for mid-level and senior consular officers and visa managers,
though he admitted that the course might be of some interest to junior
officers as well. The focus of the course would be on the Arabic language
namecheck. Since this course was just starting up, there were still many
questions surrounding it.

The CLASS back-up system is known as BNS. When BNS is in use, posts
can make local updates on their local BNS system. But global changes to
BNS, i.e., incorporating the changes made at individual locations

PageS Record of Interview


spared by: Gabrielle Anderson
Jate: 4/17/02 DOC Library: Type
Job Code: 320087 DOC Number: Type

worldwide, are compiled at headquarters and sent out to posts once a


month.
i
Mr. Williams also noted that there is a new NIV system that is currently in
the beta-testing phase. It will be piloted in London.

Biometrics Mr. Williams viewed biometrics as another tool to use in conducting a


comprehensive security check. The use of biometrics would be a move
toward the development of an identity system, rather than simply a
namecheck system. An individual would have to be much more intelligent
to foil an identity system.

Mr. Williams asserted that, despite vendor claims to the contrary, facial
recognition techniques are not especially successful. At present, both
facial recognition and fingerprinting run on very limited databases. If
either of these techniques were to become part of a standard identity
check, there would have to be a significant increase in resources to
accommodate the millions of new records. In checking fingerprints, for
example, a turn-around time of a few seconds would be needed. At
present, a fingerprint inquiry sent to the FBI takes 24-48 hours. The
introduction of biometrics would also have a significant impa'ct on
operations at post. Consular officers want to be able to adjudicate a visa
application in the course of one day, or in as little time as possible.

Documents We would like to obtain copies of the country-relationship tables used in


CLASS.

Page 6 Record of Interview

You might also like