Logo
Data Masking

Data Masking #

With the introduction of laws on user data protection, the protection of personal privacy data has risen to the legal level. Traditional application systems generally lack protection measures for personal privacy data.

Data masking can achieve encryption, masking, and replacement of the data returned by the production database according to user-defined masking rules without any changes to the data in the production database to ensure the sensitivity of the production environment data can be protected.

Challenges & Goals #

In real business scenarios, relevant DevOps teams often need to implement and maintain a set of masking functions by themselves according to data masking requirements, and the masking functions are often coupled with various business logic. Additionally, different business systems are difficult to reuse. When the masking scenario changes, the masking function maintained by itself often faces the risk of refactoring or modification.

According to industry needs for data masking and the pain points of business transformation, it provides a complete, safe, transparent, and low transformation cost data masking integration solution, which is the main design goal of the SphereEx-DBPlusEngine data masking module.

Currently, the database products supported by the SphereEx-DBPlusEngine data masking plug-in are MySQL, PostgreSQL, openGauss, SQLServer, Oracle, Hive and Presto. The specific supported versions are as follows.

databaseversion supported
1MySQL5.7.x ~ 8.x
2PostgreSQL9.6 and above
3openGauss2.1.0 and above
4SQLServer15.0 and above
5Oracle10g and above
6Hive2.3.2
7Presto0.181, 0.272

Application Scenarios #

Whether we’re talking about a new or a mature business that has already been launched, you can access the data masking function of SphereEx-DBPlusEngine to quickly complete the configuration of mask rules. Customers can use the data masking function transparently without developing a masking function coupled to the business system, and without changing any business logic and SQL.

Usage Limitations: #

  • Masked columns only support string types and do not support other non-string types.
  • Views created on tables with configured masking cannot use encryption and decryption functionality.

Process #

SphereEx-DBPlusEngine achieves the desensitization of the original data by parsing the SQL queried by users and masking the SQL execution results according to the desensitization rules provided by users.

Overall Architecture #

Mask Architecture

The desensitization module intercepts the SQL initiated by the user and analyzes and executes it through the SQL syntax parser. It then masks the query results by finding out the fields to be desensitized and the desensitization algorithm to be used according to the rules passed specified by the user and returns to the client.

Mask Rules #

Before explaining the whole process in detail, we need to first understand the desensitization rules and configuration, which is the basis of understanding the whole process.

Desensitization configuration is mainly divided into three parts: data source configuration, desensitization algorithm configuration, and desensitization table configuration:

Note: The mask rule takes effect after it is created.

mask rule

Data source configuration: the configuration of the data source.

Mask algorithm configuration: currently, SphereEx-DBPlusEngine has a variety of built-in masking algorithms: MD5, KEEP_FIRST_N_LAST_M, KEEP_FROM_X_TO_Y, MASK_FIRST_N_LAST_M, MASK_FROM_X_TO_Y, MASK_BEFORE_SPECIAL_CHARS, MASK_AFTER_SPECIAL_CHARS, SphereEx:PERSONAL_IDENTITY_NUMBER_RANDOM_REPLACE, SphereEx:MILITARY_IDENTITY_NUMBER_RANDOM_REPLACE, SphereEx:TELEPHONE_RANDOM_REPLACE, SphereEx:LANDLINE_NUMBER_RANDOM_REPLACE, SphereEx:UNIFIED_CREDIT_CODE_RANDOM_REPLACE and GENERIC_TABLE_RANDOM_REPLACE.

Users can also implement a set of desensitization algorithms by implementing the interface provided by ShardingSphere.

Please refer to Built-in Mask Algorithms for more details about the type of algorithm.

Mask table configuration: used to tell ShardingSphere which column in the data table is used for data desensitization and which algorithm is used for desensitization.

Logic column: The logical name used to calculate the masked column, which is the logical identifier of a column in SQL.

Mask Process #

For example, if there is a table in the database called t_user and there is a field in the table called phone_number that uses MASK_FROM_X_TO_Y, Apache ShardingSphere does not change the data stored.

To achieve the desensitization effect, it’ll only mask the result according to the desensitization algorithm.

As shown in the picture below:

mask process

Sample #

The following is a sample of masking the telephone number field to learn the configuration and use of the SphereEx-DBPlusEngine.

Please refer to DistSQL Maskingfor more details about the syntax of masking.

  1. Prepare a MySQL database instance for access. Create a new database: demo_ds_0.
mysql> CREATE DATABASE demo_ds_0;
  1. Create a logic database: mask_db in SphereEx-DBPlusEngine.
mysql> create database mask_db;

mysql> show databases;

mysql> use mask_db;

mysql> REGISTER STORAGE UNIT ds_0 (
    HOST="192.168.xx.103",
    PORT=3306,
    DB="demo_ds_0",
    USER="test",
    PASSWORD="Test@123"
);

mysql> SHOW STORAGE UNITS;
  1. Create a table in mask_db and insert the data.
mysql> CREATE TABLE t_mask (
    `id` int(11) NOT NULL,
    `telephone` varchar(64) DEFAULT NULL,
    `address` varchar(64) DEFAULT NULL,
    PRIMARY KEY (`id`)
);
Query OK, 0 rows affected (0.22 sec)

mysql> INSERT INTO t_mask (id, telephone, address) VALUES 
	(1,'13888888888','Beijing'),
	(2,'18000100001','Nanjing'),
	(3,'17337312345','Hangzhou'),
	(4,'13654054321','Chengdu');
Query OK, 4 rows affected (0.06 sec)
  1. Create a mask rule

Mask the characters of the telephone field with * from position 4 to 7.

mysql> CREATE MASK RULE t_mask (
 	COLUMNS(
 	(NAME=telephone,TYPE(NAME='KEEP_FIRST_N_LAST_M',PROPERTIES("first-n"=3,"last-m"=4,"replace-char"="*"))),
 	(NAME=address,TYPE(NAME='MD5'))
 	));
Query OK, 0 rows affected (0.36 sec)

mysql> SHOW MASK RULES;
+--------+-----------+---------------------+-----------------------------------+
| table  | column    | algorithm_type      | algorithm_props                   |
+--------+-----------+---------------------+-----------------------------------+
| t_mask | telephone | KEEP_FIRST_N_LAST_M | first-n=3,replace-char=*,last-m=4 |
| t_mask | address   | MD5                 |                                   |
+--------+-----------+---------------------+-----------------------------------+
2 rows in set (0.15 sec)
  1. Query the data after masking.
mysql> SELECT * FROM t_mask;
+----+-------------+----------------------------------+
| id | telephone   | address                          |
+----+-------------+----------------------------------+
|  1 | 138****8888 | 78fb473f134eed43c959f9ebdeeb4050 |
|  2 | 180****0001 | 080fa4ae8318b4cc0be7f76fefe4838e |
|  3 | 173****2345 | 95e9c48630a252346b29cd52b365d3e2 |
|  4 | 136****4321 | f4aa575f70b3f78887deb96ce611b187 |
+----+-------------+----------------------------------+
4 rows in set (0.00 sec)

You can see that part of the telephone numbers has been masked to * as expected.

  1. Change the masking rule based on the masking configuration above and check again.

Mask the telephone field with ? except for the first and the last characters.

mysql> ALTER MASK RULE t_mask (
    COLUMNS(
    (NAME=telephone,TYPE(NAME='KEEP_FIRST_N_LAST_M',PROPERTIES("first-n"=1,"last-m"=1,"replace-char"="?"))),
    (NAME=address,TYPE(NAME='MD5'))
    ));
Query OK, 0 rows affected (0.27 sec)

mysql> SHOW MASK RULES;
+--------+-----------+---------------------+-----------------------------------+
| table  | column    | algorithm_type      | algorithm_props                   |
+--------+-----------+---------------------+-----------------------------------+
| t_mask | telephone | KEEP_FIRST_N_LAST_M | first-n=1,last-m=1,replace-char=? |
| t_mask | address   | MD5                 |                                   |
+--------+-----------+---------------------+-----------------------------------+
2 rows in set (0.04 sec)
  1. Query the data again and check the results.
mysql> SELECT * FROM t_mask;
+----+-------------+----------------------------------+
| id | telephone   | address                          |
+----+-------------+----------------------------------+
|  1 | 1?????????8 | 9bd8ad60a4e25ee4f07b545c65e268c6 |
|  2 | 1?????????1 | 99116a7dc5f614c84f71834820505048 |
|  3 | 1?????????5 | 7e2d6cf42a4f42358373f9363f341d3a |
|  4 | 1?????????1 | f418a8bcbe75adb8721a9eacf16eeb0d |
+----+-------------+----------------------------------+
4 rows in set (0.01 sec)
  1. Drop the masking rule and query the data.
mysql> DROP MASK RULE t_mask;
Query OK, 0 rows affected (0.09 sec)

mysql> SHOW MASK RULES;
Empty set (0.00 sec)

mysql> SELECT * FROM t_mask;
+----+-------------+----------+
| id | telephone   | address  |
+----+-------------+----------+
|  1 | 13888888888 | Beijing  |
|  2 | 18000100001 | Nanjing  |
|  3 | 17337312345 | Hangzhou |
|  4 | 13654054321 | Chengdu  |
+----+-------------+----------+
4 rows in set (0.01 sec)

After dropping the rule, the data is recovered and can be queried properly as expected.