Working in banking, insurance, healthcare and other industries that deal with protected data I routinely encounter situations requiring data sanitization or generation of PII-like data. Looking to automation I wonder, “Are there easier ways to share, report or distribute documents containing PHI / PII?”
I need a tool to:
- Fabricate PHI / PII for training, presentations, sales and demonstrations.
- Provide documents containing PHI / PII for reporting while removing protected data.
- Use existing PHI / PII as a template to assist with 3rd party integrations.
- Generate example PHI / PII for creating and testing new features.
Let’s install the application we’ll be using to handle these scenarios: redact-phi is a flexible tool that will cover the above cases (and more). We’ll need Node.js (Recommended version) and we’ll be using Node’s built-in package manager to install redact-phi. Once you’ve installed Node, redact-phi can be installed via the command-line with:
npm i -g redact-phi
With redact-phi installed we’ll redact an example file and remove instances of protected data. First download Redact Demo and extract it. Now we should see two files:
The CSV file (demo_redact.csv) is our PII / PHI input, this is the file we’ll be redacting. The JSON file (demo_redact.json) contains the strategies we want to employ to redact data in the CSV file. Let’s look at the CSV file:
Suppose we needed to share this file with a department or third-party that does not have permission to view protected data. We need to remove the email and phone number fields while leaving the order data. Let’s look at the JSON file:
The JSON file provides strategies we’ll use to redact. Each entry within “columns” has four parameters:
- redactWith — This is a versatile field that provides us with multiple approaches to redact or (even) create PHI / PII. In this example, we’ll redact the email and phone number columns using “REDACTED …”.
- columnNum — This field (along with columnKey) identifies the column in the CSV. If needed, we can provide a column number (the first column in the document is zero, “0”). Because this file has headers, we’ll use “columnKey” instead.
- columnKey — This field uses a CSV’s header row to identify data to be redacted.
- tracked — This is an option that lets us preserve relationships within data while also redacting PHI / PII. We’ll use it shortly.
Let’s redact! Run the following in the command-line from the directory containing the CSV and JSON files:
You should see:
> redact demo_redact.csv
Finished Redacting, 6 records processed
Created File: demo_redact_redacted.csv
If you receive an error, confirm you’re running the command from the directory you extracted the demo files into. Let’s view the redacted file (demo_redact_redacted.csv):
Perfect! We’ve fully redacted the file’s protected information while retaining other data useful for reporting.
What if I want to replace real PHI / PII with fabricated, PHI-like data? Such a file could be used as a training tool, provided to a sales team, employed for executive reporting or shared with 3rd parties without risk of leaking protected data.
To replace our “REDACTED …” values with made-up data, I’ll make a small change to the JSON file:
Let’s replace the first “redactWith” entry (which previously contained “REDACTED EMAIL”) with “internet.email” and the second “redactWith” (which previously contained “REDACTED PHONE”) with “phone.phoneNumber”. Now re-run our redact program from the command-line:
Let’s view our updated redacted file (demo_redact_redacted.csv):
Note: Our changes to the JSON file now generate completely random data that looks like PHI / PII. Your entries in the email and phone number columns will differ.
This is another approach to redacting PHI / PII. We are using faker to create fictitious data that mimics the protected data. The “redactWith” parameter accepts any faker function and replaces the data in that column with made-up data which resembles PHI / PII.
What if I want to preserve relationships within the data? This would allow my fake document to retain the structure of the original file while removing all PHI / PII. My original document contains two entries for “Myrtie.McGlynn@hotmail.com” but, in the redacted file, Myrtie McGlynn becomes two different people. With a simple change I can fix this. Going back to the JSON file I’ll update the “tracked” parameter from “false” to “true”:
And run the redact program from the command-line:
Then view our redacted file (demo_redact_redacted.csv):
Success! All entries for Myrtie McGlynn have been replaced with the same fictitious person all with the same phone number. I could mark any PHI / PII columns (SSN, address, date of birth, etc.) as “tracked” and consistently replace that data with the same made-up person.
What if I need to quickly generate made-up PHI / PII? I’ll make a quick change to the CSV file by removing the “order date” and “order amount” columns then duplicating the “email” and “phone number” header for as many entries as I need. My demo_redact.csv file now looks like this (I want to create three fake entries):
To make every fabricated entry unique, I need to turn-off tracking in the JSON file:
I’m ready to create fake-people! We run:
And view the results (demo_redact_redacted.csv):
If I need more pretend people, I’ll add more entries to my CSV file.
With these approaches, I have numerous tools for controlling PHI / PII. Because I build the strategies, I am confident that data is removed, redacted or fabricated without risk of leaking protected data. For more information about working with PHI / PII see redact-phi.
Disclaimer: I am a contributor to redact-phi.