Hashing on Pandas DataFrame More Effectively

Heang Yuthakarn
2 min readDec 3, 2019

--

>10x faster way to hash column(s) than for loop

Photo by Markus Spiske on Unsplash

We use hashing to protect sensitive data in multiple ways e.g. when want to conceal data before sharing out or use it to store password than clear text (with ‘salt’ and multi-layer hashing, of course).

However, with well-known hashing algorithm like SHA512. If you know someone’s ID, it will be easy to know the hash value of that person. Salt will come to overcome this by concatenate it to original text. So knowing ID will not enough to know hash value anymore.

For Loop

╔═══════════════╦════════════════════╦═══════════════╗
║ ID ║ Name ║ Salt ║
╠═══════════════╬════════════════════╬═══════════════╣
║ 1782860041522 ║ Jon Snow ║ Winterfell ║
║ 0109330999106 ║ Cersei Lannister ║ Kings Landing ║
║ 5748532164332 ║ Daenerys Targaryen ║ Dragonstone ║
║ 4353787122878 ║ Pyke Doe ║ Euron Greyjoy ║
╚═══════════════╩════════════════════╩═══════════════╝

Suppose that we have DataFrame df which contain 100,000 rows of record show as above example. Below is function that will do following actions:

  1. get Source DataFrame as sourcedf
  2. concatenate value of column defined in column list (ID and Salt in this case)
  3. generate hash SHA512 on concatenated value and put to new column
  4. put hashed value to defined Destination DataFrame as destinationdf where column name is start with Hash_ combine with all columns in column list (Column name will be Hash_IDSalt in this case)
def hash(sourcedf,destinationdf,*column):
columnName = 'hash_'
for i in column:
columnName = columnName + i
hashColumn = pd.Series()
for i in range((len(sourcedf[column[0]]))):
concatstr = ''
for j in column:
concatstr = concatstr + df[j][i]
hashColumn.at[i] = hashlib.sha512( concatstr.encode("utf-8") ).hexdigest()
destinationdf[columnName] = hashColumn
hash(df,df,'ID','Salt')

You can see the result from below screenshot. It took more than 7 minutes

Lambda Function

The above for loop function is slow because of it goes through columns row-by-row. Let’s think about if we have 10,000,000 of records, how long will take take?

I googled around to find the faster solution. Finally I found lambda function, so below is refactor function to utilize Lambda Function

def hash(sourcedf,destinationdf,*column):
columnName = ''
destinationdf['hash_'+columnName.join(column)] = pd.DataFrame(sourcedf[list(column)].values.sum(axis=1))[0].str.encode('utf-8').apply(lambda x: (hashlib.sha512(x).hexdigest().upper()))
hash(df,df,'ID','Salt')

And Viola! It took 202 ms

--

--