Hashing on Pandas DataFrame More Effectively
>10x faster way to hash column(s) than for loop
We use hashing to protect sensitive data in multiple ways e.g. when want to conceal data before sharing out or use it to store password than clear text (with ‘salt’ and multi-layer hashing, of course).
However, with well-known hashing algorithm like SHA512. If you know someone’s ID, it will be easy to know the hash value of that person. Salt will come to overcome this by concatenate it to original text. So knowing ID will not enough to know hash value anymore.
For Loop
╔═══════════════╦════════════════════╦═══════════════╗
║ ID ║ Name ║ Salt ║
╠═══════════════╬════════════════════╬═══════════════╣
║ 1782860041522 ║ Jon Snow ║ Winterfell ║
║ 0109330999106 ║ Cersei Lannister ║ Kings Landing ║
║ 5748532164332 ║ Daenerys Targaryen ║ Dragonstone ║
║ 4353787122878 ║ Pyke Doe ║ Euron Greyjoy ║
╚═══════════════╩════════════════════╩═══════════════╝
Suppose that we have DataFrame df
which contain 100,000 rows of record show as above example. Below is function that will do following actions:
- get Source DataFrame as
sourcedf
- concatenate value of column defined in
column
list (ID
andSalt
in this case) - generate hash SHA512 on concatenated value and put to new column
- put hashed value to defined Destination DataFrame as
destinationdf
where column name is start withHash_
combine with all columns incolumn
list (Column name will beHash_IDSalt
in this case)
def hash(sourcedf,destinationdf,*column):
columnName = 'hash_'
for i in column:
columnName = columnName + i
hashColumn = pd.Series()
for i in range((len(sourcedf[column[0]]))):
concatstr = ''
for j in column:
concatstr = concatstr + df[j][i]
hashColumn.at[i] = hashlib.sha512( concatstr.encode("utf-8") ).hexdigest()
destinationdf[columnName] = hashColumnhash(df,df,'ID','Salt')
You can see the result from below screenshot. It took more than 7 minutes
Lambda Function
The above for loop
function is slow because of it goes through columns row-by-row. Let’s think about if we have 10,000,000 of records, how long will take take?
I googled around to find the faster solution. Finally I found lambda function, so below is refactor function to utilize Lambda Function
def hash(sourcedf,destinationdf,*column):
columnName = ''
destinationdf['hash_'+columnName.join(column)] = pd.DataFrame(sourcedf[list(column)].values.sum(axis=1))[0].str.encode('utf-8').apply(lambda x: (hashlib.sha512(x).hexdigest().upper()))hash(df,df,'ID','Salt')
And Viola! It took 202 ms