Guide to Anonymising data

We anonymize data to mask confidential information, but preserving the insights that emerge from the data. Here’s a step-by-step process:

  1. List all columns
  2. Pick columns action: Drop, Keep or Change. If a column has confidential business results, drop or change it.
  3. Fill the strategy to anonymize the column. Here are some strategies:
    • Categories (text columns with few unique values): Replace values. E.g.:
      • State: Replace Indian state names with US state names
      • Product: Replace banking products with retail products
      • City: Shuffle the cities (i.e. replace values with others in the same list)
    • Ordered categories (categories with order): Replace preserving order. E.g.:
      • Designation: Replace preserving order (i.e. If Manager -> Boss, Asst Manager -> Asst Boss)
    • Hierarchies (related columns): Replace as a group. E.g.:
      • State & District: Replace (State, District) with a new (State, District) combination
    • IDs (e.g. email ID, mobile, etc). Substitute alphanumerics. Retain symbols.
    • Words: Replace sensitive words
    • Dates: May be retained
    • Integers: Add a random integer. For example, ROUND(±20% * val * RANDOM())
    • Floats: Add a random number. For example, ±20% * value * RANDOM()
  4. Reduce data size by sampling. Take a natural subset by applying a filter. Use 2+ values so that filters show multiple values. E.g:
    • instead of world data, use data for any 2+ continents, or 2+ countries
    • instead of all products, pick any 2+ categories or 2+ products
    • instead of 12 months data, pick 2+ months

This is example of a plan:

Anonymization plan

Useful tools:

Reading material