1、DIMACS Working Group on Privacy / Confidentiality of Health Data Rutgers University Center Piscataway, New Jersey December 10-12, 2003,Health Care Databases under HIPAA: Statistical Approaches to De-identification of Protected Health Information,Judith E. Beach, Ph.D., Esq. Associate General Counsel
2、, Regulatory Affairs Chief Privacy Officer Chair, Council on Data Protection and Council on Research Ethics,Outline,1.Evolution of De-identification Standards HIPAA Privacy Regulation 2.De-identification Standards for Health Information in Research a. Safe Harbor b. Statistician Method )HIPAA Provis
3、ions )Quintiles Experience and Methodology c. Limited Data Set 3.Preemption of State laws on De-identification Standards for Health Information 4.Health Information Privacy - Cases and Controversies,Evolution of De-Identification Standards in HIPAA Privacy Regulation,Federal Policy: De-Identificatio
4、n of Health Information,Governments intent - to provide a balance of stringent standards flexible enough not to be a disincentive to use or disclose de-identified health information, wherever possible. De-Identified health data is one of the best mechanisms for avoiding wrongful disclosure of Protec
5、ted Health Information (PHI).See Draft (05/27/03) DHHS Policy and Procedure Manual “De-Identification Policy d11” (effective date 6/1/03) - applies to DHHS agencies: HIPAA covered health care components and Internal Business Associates,5,Federal Policy: Use of De-identified Health Data Rather than P
6、HI for Research,“We HHS expressed the hope that covered entities, their business associates and others would make greater use of de-identified health information . . . when it is sufficient for the research purpose and that such practice would reduce the burden and the confidentiality concerns that
7、result from the use of individually identifiable health information for some of these purposes.” HHS, in final privacy rule, 65 Fed. Reg. at 82543 (Dec. 28, 2000), citing proposed privacy rule of Nov. 3, 1999,6,HIPAAs Jurisdiction,Individually Identifiable Health Information (IIHI): A subset of heal
8、th information, including demographic information, that identifies the individual or with respect to which there is a reasonable basis to believe the information can be used to identify the individual Protected health information (PHI): Means individually identifiable health information (IIHI = Heal
9、th Information + Identifier) that is transmitted or maintained electronically, or transmitted or maintained in any other form or medium An investigator who submits health claims would be a HIPAA covered entity (CE) CE + Health Information + Identifier = PHI CE + Identifier - Health Information = NOT
10、 PHI Health Information + Identifier - CE = NOT PHI,7,De-identification Standards for Health Information in Research,De-identified Health Information,Definition: health information that does not identify an individual and with respect to which there is no reasonable basis to believe that the informa
11、tion can be used to identify an individual. 45 CFR 164.514(a) The Privacy Rule permits de-identification of PHI so that such information may be used and disclosed freely, without being subject to the Privacy Rules requirements. Once de-identified, the data is out of the Privacy Rule.,9,HIPAA De-iden
12、tification Standards,Two methods for the de-identification of health information: “Safe Harbor” - remove 18 specified identifiers - intended to provide a simple, definitive method for de-identifying health information with protection from litigation “Statistician Method” - retain some of the 18 safe
13、 harbors specified identifiers and demonstrate the standard is met if person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods, e.g., a Biostatistician, makes and documents that the risk of re-identification is very small.45 CFR 16
14、0.514,10,Limited Data Set,Final rule: added another method requiring removal of facial identifiers - “Limited Data Set” Under confidentiality agreements - for research, public health, and health care operations Regarded as PHI - NOT de-identified therefore, still subject to Privacy Rule requirements
15、 such as minimum necessary rule.,11,Safe Harbor Method,Safe Harbor,Covered entities must remove all of a list of 18 enumerated identifiers and have no actual knowledge that the information remaining could be used alone or in combination to identify a subject of the information. The identifiers to be
16、 removed include direct identifiers such as name, address, SSN indirect identifiers such as birth date, admission and discharge dates, and five-digit zip code 45 CFR 160.514(b)(2),13,Safe Harbor,The safe harbor does allow for the disclosure of All geographic subdivisions no smaller than a State, as
17、well as the initial three digits of a zip code IF the geographic unit formed by combining all zip codes with the same initial three digits contains more than 20,000 people AGE, if less than 90, gender, ethnicity and other demographic information not listed.,14,Safe Harbors 18 Identifiers,Names All g
18、eographic subdivisions smaller than a State, including street address, city, county, precinct, zip code, and their equivalent geocodes Except for the initial three digits of a zip code if according to the currently available data from the Bureau of the Census: The geographic unit formed by combining
19、 all zip codes with the same three initial digits contains more than 20,000 people; and The initial three digits of a zip code for all such geographic units containing 20,000 or fewer people are changed to 000; All elements of dates (except year) or dates directly relating to an individual, includin
20、g: birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older;,Telephone numbers; Fax numbers; Electronic mail addres
21、ses; Social security numbers; Medical record numbers; Health plan beneficiary numbers; Account numbers; Certificate/license numbers; Vehicle identifiers and serial numbers, including license plate numbers; Device identifiers and serial numbers; Web Universal Resource Locators (URLs); Internet Protoc
22、ol (IP) address numbers; Biometric identifiers, including finger and voice prints; Full face photographic images and any comparable images; and Any other unique identifying number, characteristic, or code.,15,Sources of Authority,In Privacy Rule Preamble, HHS recognizes two sources of authority as t
23、o what constitutes such principles and methods for de-identification adequate for posting a de-identified database on the Internet 65 Fed. Reg. at 82,709-82,710 (Dec. 28, 2000) “Paper 22”: Statistical Policy Working Paper 22Report on Statistical Disclosure Limitation Methodology “The Checklist”: The
24、 Checklist on Disclosure Potential of Proposed Data Releases -“intended primarily for use in the development of public-use data products.”,16,16,Safe Harbor,BUT many researchers and other groups have complained that the Safe Harbor renders the de-identified data as virtually useless for research so
25、that the result will be MORE research using PHI. No dates of service, no patient initials, no date of birth Can have “deltas” such as number of patient visits over time However, the safe harbor was NOT designed for research, but to provide an approved method of de-identification for any purpose by a
26、ny covered entity, regardless of sophistication. For instance, such de-identified data would be deemed to be safely posted on the Internet.,17,Statistician Method,Statistician Method,For this method, the covered entity must remove all direct identifiers reduce the number of variables on which a matc
27、h might be made should limit the distribution of records through a “data use agreement” or “restricted access agreement”65 Fed. Reg. at 82,709-710 (Dec. 28, 2000),19,Opinion of Statistician,Statistician must determine that there is a “very small risk” of re-identification after applying “generally a
28、ccepted statistical and scientific principles and methods for rendering information not individually identifiable” documents the methods and results of the analysis that justify such determination. 45 CFR 160.514(b)(1),20,Statistician Method,This method has been generally ignored by covered entities
29、. Who prefer a safe harbor approach with “safe” being the operative word. Consider the Statistician alternative as too complicated.,21,Statistician Method: Quintiles Experience,An expert statistician calculated the statistical likelihood of re-identification IF all 18 safe harbor identifiers were re
30、moved, that is, the “de-identification probability.” Then, the statistician calculated the likelihood of re-identification if certain dates of service of medical or pharmacy claims were retained And rather than age or year of birth, which is allowed in the safe harbor, the month and year of birth wa
31、s included.,22,Statisticians Opinion,This calculated number, the “de-identification probability” served as a benchmark of a “very small risk of re-identification” against which the statistician method would be compared.,23,Analysis: Comparison of Both Methods,To ensure the statistical likelihood of
32、re-identification was comparable to that of the calculated safe harbor benchmark, the following data fields were made stricter than as permitted by the safe harbor: For all patients older than 85 years of age (rather than 90), the year of their birth modified to make them all 85 years old. All five-
33、digit patient zip codes truncated to first 3 digits and further merged so that no resulting 3 digit code has a total population of less than 200,000.,24,Factors Considered by Statistician,In the analysis, the statistician pointed out the obvious: The de-identified data received is conveyed under a c
34、onfidentiality agreement, which specifically prohibits re-identification or further disclosure of the data except in statistically aggregated form. The database is maintained on a physically and technically secure, password-protected server.,25,25,Statisticians Opinion,“Applying generally accepted s
35、tatistical and scientific principles and methods for rendering information not individually identifiable, . . . I conclude that the risk is very small that the information . . . could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identif
36、y an individual who is a subject of the information. . . . In practice the actual reidentification probabilities are much, much lower . . . arguably de minimis.”,26,26,Statistician Method,It is clear that most persons who have reviewed the Privacy Rule have failed to appreciate the significance of t
37、he statistician opinion to de-identification, and, instead, have focused almost exclusively on the “safe harbor.“ In particular, many have failed to understand the importance of the “restricted access“ as it relates to the statistician opinion approach to de-identification.,27,Ensuring HIPAA Complia
38、nce,Data Warehouse,Data Encryption Process,Patient identifiable electronic healthcare claims (standard health claims data fields),De-identified data,All data handled is de-identified using a unique patient identifier that is irreversibly encrypted.,* zip = 3 digit * DOB = modified,Upon completion of
39、 the de-identification process a unique patient identifier is created, which is irreversibly encrypted.,28,Core Data Elements,Jan 98 - to date,July 98 - to date,Note: Payor Type not available on all records,29,Physician Demographics,Specialty Region Number of years in practice Prescribing volume Typ
40、e of practice Number of HMO / PPO / IPA affiliations % patient volume by insurance type Physician race Physician age,30,Patient Characteristics,Location of contact Height and weight Age Gender Race Blood pressure Cholesterol levels (total, HDL, LDL, triglycerides) Insurance type Physician reimbursem
41、ent method (fee-for-service vs. capitation) Smoker or non-smoker,31,Disease Entities,Visits (with and without drugs) Visits per physician per year Total patients seeking treatment Newly diagnosed patients Visit type (first vs. subsequent) Referrals and referring specialty Severity of condition Tests
42、 ordered or completed during visit Existing medical conditions not treated Number of times seen and days since last visit Number of patient drug requests for condition,32,Treatment Regimens,Dosage form, strength and signa Formulary impact Quantity prescribed and number of refills (mean and frequency
43、) Weighted diagnosis value Dispensing instructions Occurrences per physician per year Therapy type: NewFirst-line versus adjunct therapyDrug replacement and reason Continued,33,Treatment Regimens,Desired action Concomitant drugs (to treat same diagnosis) Concurrent drugs (regardless of diagnosis) Dr
44、ug issuance Sample days of therapy (mean and frequency) Prescribed days of therapy (mean and frequency) Daily average consumption (DACON) Non-drug therapy,34,Limited Data Set (LDS),HHS Solution: Limited Data Set,For research, public health, or health care operations purposes Authorization not requir
45、ed A limited data use agreement must be in place between the covered entity and the recipient of limited data set (LDS) 45 CFR 164.514(e) “Data Use Agreements would only be needed for those public health, research, or health care operation uses and disclosures that are not otherwise permitted by fed
46、eral or state laws.” See Draft (05/27/03) DHHS Policy and Procedure Manual “De-Identification Policy d11”,36,LDS = Still PHI,Regarded as PHI, that is, not de-identified data and, therefore subject to requirements for protection of PHI such as Prohibits re-identification or any attempt to contact ind
47、ividuals by recipient BUT re-identification code permitted for covered entity Subject to minimum necessary standards BUT no accounting of disclosures or IRB approval,37,Limited Data Set Specifications,May be useful for records-based research such as epidemiological and other population research But
48、may NOT be useful for patient recruitment Because re-identification of individuals or attempt to contact individuals is prohibited by a third party even if by Researcher (without IRB or internal privacy board approval) unless the contact is made by the Covered Entity or the Covered Entitys Workforce
49、.,38,LDS: Remove 16 Identifiers,Name Postal address information (other than city, state, zip code) Telephone number Fax number E-mail address Social Security Number Medical record / prescription numbers Health plan beneficiary numbers,Account numbers Certificate / license numbers Vehicle identity /
50、serial numbers Device numbers Web URL IP address Biometric identifiers (e.g., fingerprints, retinal scans) Full face similar photographic images,39,45 CFR 164.514(e)(2),LDS: Retain Indirect Identifiers,Five-digit zip code Dates of service (e.g., admission / discharge) Dates of birth and death Geographic subdivision (e.g., state, county, city, precinct), but not street address,