XLS to CSV

A common situation for us (people in the programming/computing/processing world) is that we don’t always work with the same tools as some of our non-tech peers.

Case in point, I received a big bunch of files in XLS/XLSX format, very big files, LibreOffice has trouble working with them. Since I want to perform quick processing on that data, and I already have scripts that process similar data in CSV, the simplest path is to transform those files to plain, ugly, useful CSV files.

Then again, there are 100 files, and I don’t feel like dancing around each one: opening, clicking save as, selecting CSV, telling LibreOffice that this is a semicolon separated CSV file … etc etc.

Entering pandas.

Pandas

Pandas is a huge python library. Apparently, it is quite suited for data analysis and has a lot of goodies within it. For me, right now, it is a nice library that will let me open an .xlsx and save it as a CSV. BINGO!

Basic XLS to CSV

Opening one XLS in pandas is a delight. There’s simply one function to call: read_excel. It supports a bunch of options that will prove really useful, but for a quick test you can simply write something on the lines of:

    import pandas as pd
    data = pd.read_excel('myfile.xlsx')

Once the file is opened and parsed into the data variable, saving is equally easy:

    data.to_csv(outfile)

This is the vanilla way to do it. No options, and as you expect, something weird may happen. The rest of this post simply deals with options and sub-options to fine-grain our reading and writing process.

Multiple Sheets

It is a fact, an excel file, more often than not, has multiple sheets. I am just interested in one of the sheets, and it can be selected using the sheetname parameter on load.

    import pandas as pd
    data = pd.read_excel('myfile.xlsx', 'MySheet')

It is the second parameter, so I don’t bother with the name and just use it as a positional parameter.

Column index

Not my case, but you may have a specific column that acts as an index. The index_col parameter takes care of that. I set it to None.

    import pandas as pd
    data = pd.read_excel('myfile.xlsx', 'MySheet', 
                        index_col=None)

CSV Styling

There’s not a lot to do with the CSV format, nevertheless you can perform some basic tidy ups like: forcing an UTF-8 encoding, a semicolon separator and specify the float format.

    data.to_csv('output.csv', 
                encoding='utf-8', 
                sep=';', 
                float_format='%.2f')

Ints are floats?!

I’ve found some problems while dealing with integer data. The xlsx input does have columns that I know are integers, but pandas is actually treating those as floats.

So, after my first xlsx to CSV format change I got something like:

500000 (in excel) -> 499999.89999 (in CSV)

As you can imagine, this is not acceptable at all. For float values I don’t mind losing some precision in the process (I actually did not find any precision loss in my dataset), but integers should be integers.

This can be achieved forcing the casting type of each column. For that the converters parameters is our friend. It takes a dictionary with caster functions.


    converters = {
        'header1' : int,
        'header2' : float
    }

    data_xls = pd.read_excel(fileName, 
        'ResultsSheet', 
        index_col=None, 
        converters=converters, 
        na_values = ['NoData'])

No data values

Some of the input files differentiate between an empty set and no data value using a string. The na_values option keeps track of all the strings or values that should be treated as no_data.


    data_xls = pd.read_excel(fileName, 
        'ResultsSheet', 
        na_values = ['NoData', 'nd', 'nodata', 'NONE'])

Those values are read as None into the data_xls variable.

Column renaming

I noted that it’s frequent to find weird column headers. Mind me, not weird for a human, but weird when processing those columns. In my humble case, I can’t accept newlines in headers. They look nice in excel, but mess the CSV output.

What pandas offers is a rename function. It takes a dictionary of headers with the current name and the new value:

    headers = {
    'Header with \n newlines' : 'new header',
    'whateverName' : 'my new name',
    }

    data.rename(columns=headers, inplace=True)

To curate the input headers we just have to loop the existing headers and remove any unwanted characters. To get the headers from a pandas data frame, we use the list function.

    fileHeaders = list(data)
    newHeaders = {}
    for header in fileHeaders:
        newHeaders[header] = header.replace('\n','')

    data.rename(columns=newHeaders, inplace=True)

Conclusion

I am quite surprised at the pandas library and its wide range of options. While reading the documentation I’ve found a lot of stuff to like but this post only plays with a very very very small subset of what pandas can do.

References

Pandas website ⇒GO

Advertisements

Leave a comment

Filed under code, tips, tools, Uncategorized

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s