1、Anonymization Algorithms - Microaggregation and Clustering,Li XiongCS573 Data Privacy and Anonymity,Anonymization using Microaggregation or Clustering,Practical Data-Oriented Microaggregation for Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002 Ordinal, Continuous and Heterogeneous k-anonym
2、ity through microaggregation, Domingo-Ferrer, DMKD 2005 Achieving anonymity via clustering, Aggarwal, PODS 2006 Efficient k-anonymization using clustering techniques, Byun, DASFAA 2007,Anonymization MethodsPerturbative: distort the data statistics computed on the perturbed dataset should not differ
3、significantly from the original microaggregation, additive noise Non-perturbative: dont distort the data generalization: combine several categories to form new less specific category suppression: remove values of a few attributes in some records, or entire records,Types of data Continuous: attribute
4、 is numeric and arithmetic operations can be performed on it Categorical: attribute takes values over a finite set and standard arithmetic operations dont make sense Ordinal: ordered range of categories , min and max operations are meaningful Nominal: unordered only equality comparison operation is
5、meaningful,Measure tradeoffsk-Anonymity: a dataset satisfies k-anonymity for k 1 if at least k records exist for each combination of quasi-identifier valuesassuming k-anonymity is enough protection against disclosure risk, one can concentrate on information loss measures,Satisfying k-anonymity using
6、 generalization and suppression is NP-hard Computational cost of finding the optimal generalization How to determine the subset of appropriate generalizations semantics of categories and intended use of data e.g., ZIP code: 08201, 08205 - 0820* makes sense 08201, 05201 - 0*201 doesnt,Critique of Gen
7、eralization/Suppression,Problems cont. How to apply a generalization globally may generalize records that dont need it locally difficult to automate and analyze number of generalizations is even larger Generalization and suppression on continuous data are unsuitable a numeric attribute becomes categ
8、orical and loses its numeric semantics,Problems cont. How to optimally combine generalization and suppression is unknown Use of suppression is not homogenous suppress entire records or only some attributes of some records blank a suppressed value or replace it with a neutral value,Microaggregation/C
9、lustering,Two steps: Partition original dataset into clusters of similar records containing at least k records For each cluster, compute an aggregation operation and use it to replace the original records e.g., mean for continuous data, median for categorical data,Advantages: a unified approach, unl
10、ike combination of generalization and suppression Near-optimal heuristics exist Doesnt generate new categories Suitable for continuous data without removing their numeric semantics,Advantages cont.,Reduces data distortion K-anonymity requires an attribute to be generalized or suppressed, even if all
11、 but one tuple in the set have the same value. Clustering allows a cluster center to be published instead, “enabling us to release more information.”,Original Table,2-Anonymity with Generalization,Generalization allows pre-specified ranges,2-Anonymity with Clustering,Cluster centers (27,70, 37,115)
12、published,27=(25+27+29)/3 70=(50+60+100)/3,37=(35+39)/2 115=(110+120)/2,Another example: no common value among each attribute,Generalization vs. clustering,Generalized version of the table would need to suppress all attributes. Clustered Version of the table would publish the cluster center as (1, 1
13、, 1, 1), and the radius as 1.,Anonymization using Microaggregation or Clustering,Practical Data-Oriented Microaggregation for Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002 Ordinal, Continuous and Heterogeneous k-anonymity through microaggregation, Domingo-Ferrer, DMKD 2005 Achieving anon
14、ymity via clustering, Aggarwal, PODS 2006 Efficient k-anonymization using clustering techniques, Byun, DASFAA 2007,Multivariate microaggregation algorithm MDAV-generic: Generic version of MDAV algorithm (Maximum Distance to Average Vector) from previous papers Works with any type of data (continuous
15、, ordinal, nominal), aggregation operator and distance calculation,MDAV-generic(R: dataset, k: integer)while |R| 3k1. compute average record x of all records in R2. find most distant record xr from x3. find most distant record xs from xr4. form two clusters from k-1 records closest to xr and k-1 clo
16、sest to xs5. Remove the clusters from R and run MDAV-generic onthe remaining datasetend whileif 3k-1 |R| 2k1. compute average record x of remaining records in R2. find the most distant record xr from x3. form a cluster from k-1 records closest to x4. form another cluster containing the remaining rec
17、ordselse (fewer than 2k records in R) form a new cluster from the remaining records,MDAV-generic for continuous attributes use arithmetic mean and Euclidean distance standardize attributes (subtract mean and divide by standard deviation) to give them equal weight for computing distances After MDAV-g
18、eneric, destandardize attributes xij is value of k-anonymized jth attribute for the ith record m10(j) and m2(j) are mean and variance of the k-anonymized jth attribute u10(j) and u2(j) are mean and variance of the original jth attribute,MDAV-generic for ordinal attributes The distance between two ca
19、tegories a and b in an attribute Vi: dord(a,b) = (|i| i b|) / |D(Vi)| i.e., the number of categories separating a and b divided by the number of categories in the attribute Nominal attributes The distance between two values is defined according to equality: 0 if theyre equal, else 1,Empirical Result
20、s,Continuous attributes From the U.S. Current Population Survey (1995) 1080 records described by 13 continuous attributes Computed k-anonymity for k = 3, ., 9 and quasi-identifiers with 6 and 13 attributes Categorical attributes From the U.S. Housing Survey (1993) Three ordinal and eight nominal att
21、ributes Computed k-anonymity for k = 2, ., 9 and quasi-identifiers with 3, 4, 8 and 11 attributes,IL measures for continuous attributes IL1 = mean variation of individual attributes in original and k-anonymous datasets IL2 = mean variation of attribute means in both datasets IL3 = mean variation of
22、attribute variances IL4 = mean variation of attribute covariances IL5 = mean variation of attribute Pearsons correlations IL6 = 100 times the average of IL1-6,MDAV-generic preserves means and variances The impact on the non-preserved statistics grows with the quasi-identifier length, as one would ex
23、pect For a fixed-quasi-identifier length, the impact on the non-preserved statistics grows with k,IL measures for categorical attributes Dist: direct comparison of original and protected values using a categorical distance CTBIL: mean variation of frequencies in contingency tables for original and p
24、rotected data (based on another paper by Domingo-Ferrer and Torra) ACTBIL: CTBIL divided by the total number of cells in all considered tables EBIL: Entropy-based information loss (based on another paper by Domingo-Ferrer and Torra),Ordinal attribute protection using median,Ordinal attribute protect
25、ion using convex median,Anonymization using Microaggregation or Clustering,Practical Data-Oriented Microaggregation for Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002 Ordinal, Continuous and Heterogeneous k-anonymity through microaggregation, Domingo-Ferrer, DMKD 2005 Achieving anonymity
26、via clustering, Aggarwal, PODS 2006 Efficient k-anonymization using clustering techniques, Byun, DASFAA 2007,r-Clustering,Attributes from a table are first redefined as points in metric space. These points are clustered, and then the cluster centers are published, rather than the original quasi-iden
27、tifiers. r is the lower bound on the number of members in each cluster. r is used instead of k to denote the minimum degree of anonymity because k is typically used in clustering to denote the number of clusters.,Data published for clusters,Three features are published for the clustered data the qua
28、si-identifying attributes of the cluster center the number of points within the cluster the set of sensitive values for the cluster (which remain unchanged, as with k-anonymity) A measure of the quality of the clusters will also be published.,Defining the records in metric space,Some attributes, suc
29、h as age and height, are easily mapped to metric space. Others, such as zip may first need to be converted, for example to longitude and latitude. Some attributes may need to be scaled, such as location, which may differ by thousands of miles. Some attributes such as race or nationality may not conv
30、ert to points in metric space easily.,How to measure the quality of the cluster,Measures how much it distorts the original data. Maximum radius (r-GATHER problem) Maximum radius of all clusters Cellular cost (r-CELLULAR CLUSTERING problem) Each cluster incurs a “facility cost” to set up the cluster
31、center. Each cluster incurs a “service cost” which is equal to the radius times the number of points in the cluster Sum of the facility and services costs for each of the clusters.,25 points, radius 10,14 points, radius 8,17 points, radius 7,Points arranged in clusters,Cluster quality measurements,M
32、aximum radius = 10 Facility cost plus service cost: Facility cost = f(c) Service cost = (17 x 7) + (14 x 8) + (25 x 10) = 481,r-GATHER problem,“The r-Gather problem is to cluster n points in a metric space into a set of clusters, such that each cluster has at least r points. The objective is to mini
33、mize the maximum radius among the clusters.”,“Outlier” points,r-GATHER and r-CELLULAR CLUSTERING, like k-anonymity, are sensitive to outlier points (i.e., points which are far removed from the rest of the data). The clustering solutions in this paper are generalized to allow an e fraction of outlier
34、s to be removed from the data, that is, e fraction of the tuples can be suppressed.,(r,e)-GATHER Clustering,The (r, e)-GATHER clustering formulation of the problem allows an e fraction of the outlier points to be unclustered (i.e., these tuples are suppressed). The paper finds that there is a polyno
35、mial time algorithm that provides a 4-approximation for the (r,e)-GATHER problem.,r-CELLULAR CLUSTERING defined,The CELLULAR CLUSTERING problem is to arrange n points into clusters with each cluster has at least r points and with the minimum total cellular cost.,(r,e)-CELLULAR CLUSTERING,There is al
36、so a (r,e)-CELLULAR CLUSTERING problem in which an e fraction of the points can be excluded. The details of the constant-factor approximation of this problem are deferred to the full version of this paper.,Anonymization using Microaggregation or Clustering,Practical Data-Oriented Microaggregation fo
37、r Statistical Disclosure Control, Domingo-Ferrer, TKDE 2002 Ordinal, Continuous and Heterogeneous k-anonymity through microaggregation, Domingo-Ferrer, DMKD 2005 Achieving anonymity via clustering, Aggarwal, PODS 2006 Efficient k-anonymization using clustering techniques, Byun, DASFAA 2007,41,Anonym
38、ization And Clustering,k-Member Clustering Problem From a given set of n records, find a set of clusters such that Each cluster contains at least k records, and The total intra-cluster distance is minimized. The problem is NP-complete,42,Distance Metrics,Distance metric for records Measure the dissi
39、milarities between two data points Sum of all dissimilarities between corresponding attributes. Numerical values Categorical values,43,Distance between two numerical values,DefinitionLet D be a finite numeric domain. Then the normalized distance between two values vi, vj D is defined as:where |D| is
40、 the domain size measured by the difference between the maximum and minimum values in D.,Example 1 Distance between r1 and r2 with respect to Age attribute is |57-41|/|57-24| = 16/33 = 0.4848,Example 2 Distance between r5 and r6 with respect to Age attribute is |24-45|/|57-24| = 21/33 = 0.6363,44,Di
41、stance between two categorical values,Equally different to each other. 0 if they are the same 1 if they are differentRelationships can be easily captured in a taxonomy tree.,Taxonomy tree of Country,Taxonomy tree of Occupation,45,Distance between two categorical values,DefinitionLet D be a categoric
42、al domain and TD be a taxonomy tree defined for D. The normalized distance between two values vi, vj D is defined as:where (x, y) is the subtree rooted at the lowest common ancestor of x and y, and H(T) represents the height of tree T.,Taxonomy tree of Country,Example: The distance between India and
43、 USA is 3/3 = 1.The distance between India and Iran is 2/3 = 0.66.,46,Distance between two records,DefinitionLet QT = N1, . . . ,Nm, C1, . . . ,Cn be the quasi-identifier of table T , where Ni(i = 1, . . . ,m) is an attribute with a numeric domain and Ci(i = 1, . . . , n) is an attribute with a cate
44、gorical domain. The distance of two records r1, r2 T is defined as:where N is the distance function for numeric attribute, and C is the distance function for categorical attribute.,47,Distance between two records Continued,Taxonomy tree of Country,Taxonomy tree of Occupation,48,Cost Function - Infor
45、mation loss (IL),The amount of distortion (i.e., information loss) caused by the generalization process.Note: Records in each cluster are generalized to share the same quasi-identifier value that represents every original quasi-identifier value in the cluster.Definition: Let e = r1, . . . , rk be a
46、cluster (i.e., equivalence class). Then the amount of information loss in e, denoted by IL(e), is defined as:where |e| is the number of records in e, |N| represents the size of numeric domain N, (Cj) is the subtree rooted at the lowest common ancestor of every value in Cj, and H(T) is the height of
47、tree T.,49,Cost Function - Information loss (IL),Taxonomy tree of Country,Cluster e1,Example,Cluster e2,50,Greedy k-member clustering algorithm,51,Diversity Metrics,The Equal Diversity metric (ED) assumes all sensitive attribute values are equally sensitivewhere (e, s) = 1 if every record in e has t
48、he same s value; (e, s) = 0, otherwise.Modification to the greedy algorithm:,Sensitive Diversity metric (SD) assumes there are two types of values in a sensitive attribute: truly-sensitive not-so-sensitivewhere (e, s) = 1 if every record in e has the same s value that is truly-sensitive; (e, s) = 0,
49、 otherwiseModification to the greedy algorithm,52,classification metric (CM),preserve the correlation between quasi-identifier and class labels (non-sensitive values)Where N is the total number of records, and Penalty(row r) = 1 if r is suppressed or the class label of r is different from the class
50、label of the majority in the equivalence group.Modification to the greedy algorithm:,53,Experimentl Results,Experimentl Setup Data: Adult dataset from the UC Irvine Machine Learning Repository 10 attributes (2 numeric, 7 categorical, 1 class) Compare with 2 other algorithms Median partitioning (mondrian algorithm) k-Nearest neighbor,