12 views

Uploaded by postscript

- A Survey on Machine Learning Approach in Data Mining
- IRJET-An Effective Way of Mining High Utility Itemsets from Large Transnational Databases
- ICADRE14-021
- THE EFFECTS OF COMMUNICATION NETWORKS ON STUDENTS’ ACADEMIC PERFORMANCE: THE SYNTHETIC APPROACH OF SOCIAL NETWORK ANALYSIS AND DATA MINING FOR EDUCATION
- Physics 203p
- Simulation Technique
- Web Page Classification Using Firefly Optimization
- IJMET_10_01_035.pdf
- 1228BW12finv1
- Statistical Distribution
- [Douglas_C._Montgomery]_Student_Workbook_with_Solu(BookFi).pdf
- Chapter 5001
- MathStatII.pdf
- 20080849
- Report Study on Cooperative Data Dissemination_153903015.pdf
- m Stat Sample 2008
- STA Vocab
- Chapter 1 Introduction Ross
- lms notes
- Arizona PMS

You are on page 1of 27

Ramakrishnan Srikant

www.almaden.ibm.com/cs/people/srikant/

Motivation

Privacy & Data Mining: Adversaries forever?

The primary task in data mining: development of

models about aggregated data.

Can we develop accurate models without access to

precise information in individual data records?

1

Papers

R. Agrawal and R. Srikant, \Privacy Preserving Data

Mining", Proc. of the ACM SIGMOD Conference

on Management of Data, Dallas, Texas, May 2000.

Y. Lindell and B. Pinkas, \Privacy Preserving Data

Mining", Crypto 2000, August 2000.

2

Talk Overview

Randomization protects information at the

individual level.

Algorithm to reconstruct the distribution of values.

Use reconstructed distributions in data mining

algorithms, e.g. to build decision-tree classier.

How well does it work?

3

Privacy Surveys

[CRA99b] survey of web users:

{ 17% privacy fundamentalists

{ 56% pragmatic majority

{ 27% marginally concerned

[Wes99] survey of web users:

{ 82% : privacy policy matters

{ 14% don't care

Not equally protective of every eld

{ may not divulge at all certain elds;

{ may not mind giving true values of certain elds;

{ may be willing to give not true values but

modied values of certain elds.

4

Related Work: Statistical Databases

Statistical Databases : provide statistical information without

compromising sensitive information about individuals (surveys:

[AW89] [Sho82])

Query Restriction

{ restrict the size of query result (e.g. [FEL72][DDS79])

{ control overlap among successive queries (e.g. [DJL79])

{ keep audit trail of all answered queries (e.g. [CO82])

{ suppress small data cells (e.g. [Cox80])

{ cluster entities into mutually exclusive atomic populations

(e.g. [YC77])

Data Perturbation

{ replace the original database by a sample from the same

distribution (e.g. [LST83][LCL85][Rei84])

{ sample the result of a query (e.g. [Den80])

{ swap values between records (e.g. [Den82])

{ add noise to the query result (e.g. [Bec80])

{ add noise to the values (e.g. [TYW84][War65])

5

Related Work: Statistical Databases

(cont.)

Negative results: cannot give high quality statistics

and simultaneously prevent partial disclosure of

individual information [AW89]

preserving data mining.

{ Also want to prevent disclosure of condential

information

{ But sucient to reconstruct original distribution

of data values, i.e. not interested in high quality

point estimates

6

Using Randomization to protect Privacy

Return xi + r instead of xi, where r is a random

value drawn from a distribution.

{ Uniform

{ Gaussian

Fixed perturbation { not possible to improve

estimates by repeating queries.

Algorithm knows parameters of r's distribution.

7

Talk Overview

Randomization protects information at the

individual level.

Algorithm to reconstruct the distribution of values.

Use reconstructed distributions to build decision-

tree classier.

How well does it work?

8

Reconstruction Problem

Original values x1; x2; : : : ; xn

{ realizations of iid random variables X1; X2; : : : ; Xn,

{ each with the same distribution as random

variable X .

To hide these values, we use y1; y2; : : : ; yn

{ realizations of iid random variables Y1 ; Y2; : : : ; Yn,

{ each with the same distribution as random

variable Y .

Given

x1 + y1; x2 + y2; : : : ; xn + yn

the density function fY for Y ,

estimate the density function fX for X .

9

Using Bayes' Rule

Assume we know both fX and fY .

Let wi xi + yi .

fX1 (a j X1 + Y1 = w1)

fX1+Y1 (w1 j X1 = a) fX1 (a)

=

fX1+Y1 (w1)

fX1+Y1 (w1 j X1 = a) fX1 (a)

= R1

1 fX1+Y1 (w1 j X1 = z ) fX1 (z ) dz

fY (w1 a) fX (a)

1 1

= R 1 f (w z ) f (z ) dz (Y1 independent of X1)

1 Y1 1 X1

fY (w1 a) fX (a)

= R1 (fX1 fX , fY1 fY )

1 Y 1

f ( w z ) f X ( z ) dz

1X

n

0

fX (a) n fXi (a j Xi + Yi = wi)

i=1

1X

n

R1 Y i

f (w a) fX (a)

=

n

i=1 1 fY (wi z ) fX (z ) dz

10

Reconstruction Method: Algorithm

fX0 := Uniform distribution

j := 0 // Iteration number

repeat

Use equation to compute a new estimate fXj+1.

j := j + 1

until (stopping criterion met)

successive estimates of the original distribution

becomes very small (1% of the threshold of the 2

test).

11

Using Partitioning to Speed Computation

distance(z, wi) distance between the mid-points

of the intervals in which they lie, and

density function fX (a) the average of the density

function over the interval in which a lies.

=

n i=1 11 fY (wi z ) fX (z ) dz

becomes

0 X 2 Ip ) =

Pr (

k X fY (m(Is) m(Ip)) Pr(X 2 Ip)

1

n s=1 N (Is ) Pkt=1 fY (m(Is) m(It)) Pr(X 2 It)

number of intervals.

12

How well does this work?

Uniform random variable [-0.5, 0.5]

1200

Original

Randomized

1000 Reconstructed

Number of Records

800

600

400

200

0

-1 -0.5 0 0.5 1 1.5 2

Attribute Value

1000

Original

Randomized

800 Reconstructed

Number of Records

600

400

200

0

-1 -0.5 0 0.5 1 1.5 2

Attribute Value

13

Talk Overview

Randomization protects information at the

individual level.

Algorithm to reconstruct the distribution of values.

Use reconstructed distributions to build decision-

tree classier.

How well does it work?

14

Decision Tree Classication

Classication: Given a set of classes, and a set of

records in each class, develop a model that predicts

the class of a new record.

Partition(Data S )

begin

if (most points in S are of the same class) then

return;

for each attribute A do

evaluate splits on attribute A;

Use best split to partition S into S1 and S2;

Partition(S1 );

Partition(S2 );

end

Initial call: Partition(TrainingData)

15

Example: Decision Tree Classication

Training Set:

Age Salary Credit Risk

23 50K High

17 30K High

43 40K High

68 50K Low

32 70K Low

20 20K High

Decision Tree:

Age < 25

High

High Low

16

Training using Randomized Data

Need to modify two key operations:

{ Determining a split point.

{ Partitioning the data.

When and how do we reconstruct the original

distribution?

{ Reconstruct using the whole data (globally) or

reconstruct separately for each class?

{ Reconstruct once at the root node or reconstruct

at every node?

17

Training using Randomized Data (cont.)

Determining split points:

{ Candidate splits are interval boundaries.

{ Use statistics from the reconstructed distribution.

Partitioning the data:

{ Reconstruction gives estimate of number of

points in each interval.

{ Associate each data point with an interval by

sorting the values.

18

Algorithms

Global:

Reconstruct for each attribute once at the

beginning.

Induce decision tree using reconstructed data.

ByClass:

For each attribute, rst split by class, then

reconstruct separately for each class.

Induce decision tree using reconstructed data.

Local:

As in ByClass, split by class and reconstruct

separately for each class.

However, reconstruct at each node (not just once).

19

Talk Overview

Randomization protects information at the

individual level.

Algorithm to reconstruct the distribution of values.

Use reconstructed distributions to build decision-

tree classier.

How well does it work?

20

Methodology

Compare accuracy of Global, ByClass and Local

against

{ Original: unperturbed data without randomization.

{ Randomized: perturbed data but without making

any corrections for randomization.

Synthetic data generator from [AGI+92].

Training set of 100,000 records, split equally

between the two classes.

21

Quantifying Privacy

If it can be estimated with c% condence that a

value x lies in the interval [x1; x2], then the interval

width (x2 x1) denes the amount of privacy at c%

condence level.

Example: Privacy Level for Age[10,90]

{ Given a perturbed value 40

{ 95% condence that true value lies in [30,50]

{ Interval Width : 20 ) 25% privacy level

Range : 80

Uniform: between [ ; + ]

Gaussian: mean = 0 and standard deviation

Condence

50% 95% 99.9%

Uniform : 2

05 : 2

0 95 :

0 999 2

Gaussian 1:34 3:92 6:8

22

Synthetic Data Functions

Class A if function is true, Class B otherwise.

F1 (age 40) _ ((60 age)

<

< K K

((40 age < 60) ^ (75 salary 125 )) _

K K

((age 60) ^ (25 K salary 75 )) K

F3 ((age 40)^

<

(((elevel 2 [0 1]) ^ (25

:: salary 75 )) _

K K

((elevel 2 [2 3]) ^ (50

:: salary 100 )))) _

K K

((40 age < 60)^

(((elevel 2 [1 3]) ^ (50

:: salary 100 )) _

K K

(((elevel = 4)) ^ (75 salary 125 )))) _

K K

((age 60)^

(((elevel 2 [2 4]) ^ (50

:: salary 100 )) _

K K

((elevel = 1)) ^ (25 K salary 75 )))) K

: : K >

: :

0 2 equity

: 10 ) K 0

>

where equity = 0 1 hvalue max(hyears 20 0)

: ;

23

Classication Accuracy

100

95

Privacy Level:

Accuracy (%)

90

25% of 85 Original

Attribute Range Local

ByClass

Global

80 Randomized

75

Fn1 Fn2 Fn3 Fn4 Fn5

Dataset

100

90

Privacy Level:

Accuracy (%)

80

100% of 70

Attribute Range

Original

Local

60 ByClass

Global

Randomized

50

Dataset

24

Change in Accuracy with Privacy

100

90

Accuracy (%)

80

Fn 1 70

Original

ByClass(G)

60 ByClass(U)

Random(G)

Random(U)

50

Privacy Level (%)

100

90

Accuracy (%)

80

Fn 3 70

Original

ByClass(G)

60 ByClass(U)

Random(G)

Random(U)

50

Privacy Level (%)

25

Conclusions

Have your cake and mine it too!

{ Preserve privacy at the individual level, but still

build accurate models.

Future work:

{ other data mining algorithms,

{ characterize loss in accuracy,

{ other randomization functions,

{ better reconstruction techniques, ...

26

- A Survey on Machine Learning Approach in Data MiningUploaded byInternational Journal for Scientific Research and Development - IJSRD
- IRJET-An Effective Way of Mining High Utility Itemsets from Large Transnational DatabasesUploaded byIRJET Journal
- ICADRE14-021Uploaded bytth28288969
- THE EFFECTS OF COMMUNICATION NETWORKS ON STUDENTS’ ACADEMIC PERFORMANCE: THE SYNTHETIC APPROACH OF SOCIAL NETWORK ANALYSIS AND DATA MINING FOR EDUCATIONUploaded byijite
- Physics 203pUploaded bylalslsal
- Simulation TechniqueUploaded bySaurabh Kaul
- Web Page Classification Using Firefly OptimizationUploaded byVishal Bauskar
- IJMET_10_01_035.pdfUploaded byContact Iaeme
- 1228BW12finv1Uploaded byDayi David Li
- Statistical DistributionUploaded byMuniswamaiah Mohan
- [Douglas_C._Montgomery]_Student_Workbook_with_Solu(BookFi).pdfUploaded byTeka Tesfaye
- Chapter 5001Uploaded byjangra014
- MathStatII.pdfUploaded bynitin_y11
- 20080849Uploaded byFikhri Abduhan
- Report Study on Cooperative Data Dissemination_153903015.pdfUploaded byWahab Kukaswadia
- m Stat Sample 2008Uploaded byxyza304gmailcom
- STA VocabUploaded byRino Oktora
- Chapter 1 Introduction RossUploaded bytulikaasthana
- lms notesUploaded byHamid Ullah
- Arizona PMSUploaded byBEATRIZ
- Engineering Probability and Statistics Problem SetUploaded byCorazon Francisco Austria
- LP-FOR-WEEK-1-STAT-PROB.docxUploaded byGelo Gonzales
- 1271829760_NSS_M1V2_12_FS__eng_Uploaded byKaji Kung
- 05.pdfUploaded byJosé Arturo Azpeitia
- Modul 1 JpntUploaded byunder18
- assignment.docxUploaded byMohit
- Probability TutorialUploaded byDenno Stein
- Math 146 OnlineUploaded bydannyf09
- IntroductionUploaded byYana Mk
- Week 3 - Assignment #1Uploaded byPriya Prasad

- sec2000Uploaded bypostscript
- 20amortUploaded bypostscript
- news2 98Uploaded bypostscript
- prpr0118Uploaded bypostscript
- basuUploaded bypostscript
- wenke-discex00Uploaded bypostscript
- testsecUploaded bypostscript
- it2 fact sheetUploaded bypostscript
- mmidans2Uploaded bypostscript
- 64 Interview QuestionsUploaded byshivakumar N
- 64 Interview QuestionsUploaded byshivakumar N
- SR302Uploaded bypostscript
- countsUploaded bypostscript
- edcUploaded bypostscript
- patentsUploaded bypostscript
- hagjoy10CMPUploaded bypostscript
- nmarea rose slidesUploaded bypostscript
- v7i1r46Uploaded bypostscript
- paperUploaded bypostscript
- snackenUploaded bypostscript
- lec6Uploaded bypostscript
- Chip to Kapton Bonding Procedur1Uploaded bypostscript
- ecsUploaded bypostscript
- rpf280Uploaded bypostscript
- lm02Uploaded bypostscript
- hagjoycombesUploaded bypostscript
- 2-1-8Uploaded bypostscript
- cvUploaded bypostscript
- net neutralityUploaded bypostscript

- DBMSUploaded byBaba Khan
- Kes04 YadaUploaded bysukande
- Machine Learning and Data Mining - Course NotesUploaded byherman_darmawan9510
- 4_SVRAJU Attainement of OutcomesUploaded byDr.R.Udaiya Kumar Professor & Head/ECE
- 4-nn2-perceptronUploaded byKhalifa Bakkar
- Data Hasil FtirUploaded byFarida Kusumaningrum Parnudju
- Final Lab Manual of DWM-(July-Dec 2009)1(3)Uploaded byMalay Patel
- M.tech in Computer Information Science SyllabusUploaded byanon_524870678
- DATA MINING AND DATA WAREHOUSINGUploaded byBridget Smith
- Chapter 05 Database Systems and ApplicationsUploaded byAnonymous Fd25BsY407
- SecretsUploaded byHallowed
- Open Pit Design 1Uploaded byalanmu7
- 01-Unit1Uploaded bySridhar Kumar
- DMDBSUploaded byCS & IT
- GIS,DMstat.pptxUploaded byAmaranatha Reddy P
- UPEC_2016_DCUploaded byDiego Carrión
- Curriculum for Data ScienceUploaded bynamkhanh87
- MATLAB Based Graphical User Interface GUI for Data Mining as a Tool for Environment Management LibreUploaded byakshuk
- Data Minning ProblemUploaded bynilesh2306
- A Survey On Data Mining Techniques In Customer Churn Analysis For Telecom IndustryUploaded byAnonymous 7VPPkWS8O
- Data-Mining.pdfUploaded bytod ni
- 5927 Software Statistika FreeUploaded byYahya Ubaid
- Journal of Computer Science and Information Security IJCSIS May 2013Uploaded byijcsis
- A Brief Survey on Data Mining for Biological and Environmental Problems.Uploaded byHarikrishnan Shunmugam
- (9) How Can I Become a Data Scientist_ - QuoraUploaded byPallav Anand
- Attribute FocusingUploaded bymarcway
- Data Mining for the Masses, Matthew NorthUploaded bypiMLeon
- Weka Toolkit.pdfUploaded byTahir
- Introduction to Data MiningUploaded byIoana
- Biclust STIS Jan 2017Uploaded by3rlang