2008-08-21 23:15:13

Hash fields in a CSV file with Python and SHA256

I had a request recently where someone asked for a quick and dirty script that would take a CSV file, and parse certain fields for hashing. I decided to use python due to the nice speed of the hashlib module (and because I simply like python!).

If you look at the code, it doesn't contain two things ... comments and error checking. I know, I know. That's coming in the next version. Remember, I was asked for "quick and dirty"!

One thing to note is that the script can handle multiple field arguments. So, for example, you can use the following argument: "-f 2 -f 3 -f 8" and it will parse the 3rd, 4th and 9th field (counting begins at zero :)

The other thing to notice is that there is a --md5 option. I was asked to add that feature since the SHA256 output is extremely long and takes up a lot of space in a database when you're talking a few billion rows. The --md5 option takes the SHA256 output and performs an MD5 hash on it, reducing it to a smaller, but still unique, string.

#!/bin/env python
#################################################################
# Who: James Conner
# When: June 16, 2008
# What: csv_hash.py
# Version: 1.0.1
# Why: Encrypt fields within a CSV file
#################################################################
# Updates:
# Ver:Who:When:Why
# 0.0.1:James Conner:Jun 16 2008:Initial creation
# 0.0.2:James Conner:Jul 07 2008:Added multi field hashing
# 1.0.0:James Conner:Aug 21 2008:Added MD5 output of SHA256 hash
# 1.0.1:James Conner:Aug 21 2008:Fixed field check
#################################################################

#################################################################
# Import Modules
#################################################################
import os
import sys
import string
import hashlib
import csv
from optparse import OptionParser

#################################################################
# Global Variables
#################################################################

#################################################################
# Option Parser
#################################################################
parser = OptionParser(version = "1.0.1")

parser.add_option('-d','--delimiter',
        dest='delimiter_info',
        default=',',
        metavar='DELIMIT',
        help=('The delimiter. Default is \',\'.'))

parser.add_option('-i','--infile',
        dest='infilename_info',
        default='',
        metavar='INFILENAME',
        help=('Name of the input file'))

parser.add_option('-f','--field',
        dest='field_info',
        action='append',
        type='int',
        metavar='FIELDNUM',
        #default='0', # int does not allow a default value
        help=('Which field to encrypt, default is 0'))

parser.add_option('-o','--outfile',
        dest='outfilename_info',
        default='outfile.csv',
        metavar='OUTFILENAME',
        help=('Name of the output file'))

parser.add_option('--md5',
        dest='md5_info',
        action='store_true',
        default=False,
        metavar='MD5',
        help=('Use MD5 against the SHA256 hash.'))

(opts, arg) = parser.parse_args()

#################################################################
# Functions
#################################################################

def check_options(option,option_name):
        if not option:
                print "ERROR: %s variable has not been assigned!" % option_name
                sys.exit(10)

def perf_hash (DATA,MD5):
        if MD5 is True:
                TEMP_DATA = hashlib.sha256(DATA).hexdigest().strip()
                return (hashlib.md5(TEMP_DATA).hexdigest())
        else:
                return(hashlib.sha256(DATA).hexdigest())

def csv_file (FILENAME,DELIMITER,FIELDNUM,OUTFILE,MD5FLAG):
        r = csv.reader(open(FILENAME, 'r'), dialect='excel', delimiter=DELIMITER)
        w = csv.writer(open(OUTFILE, 'w'), dialect='excel', delimiter=DELIMITER)
        for rows in r:
                for f in FIELDNUM:
                        rows[int(f)] = perf_hash(rows[int(f)],MD5FLAG)
                w.writerow(rows)
        sys.exit(0)

#################################################################
# Program Execution
#################################################################

check_options(opts.infilename_info, "infilename_info") check_options(opts.field_info, "field_info")

csv_file(opts.infilename_info, opts.delimiter_info, opts.field_info, opts.outfilename_info, opts.md5_info)

Tags:   python     |    Perm Link:   Hash fields in a CSV file with Python and SHA256



James Conner