XLS to CSV

A common situation for us (people in the programming/computing/processing world) is that we don’t always work with the same tools as some of our non-tech peers.

Case in point, I received a big bunch of files in XLS/XLSX format, very big files, LibreOffice has trouble working with them. Since I want to perform quick processing on that data, and I already have scripts that process similar data in CSV, the simplest path is to transform those files to plain, ugly, useful CSV files.

Then again, there are 100 files, and I don’t feel like dancing around each one: opening, clicking save as, selecting CSV, telling LibreOffice that this is a semicolon separated CSV file … etc etc.

Entering pandas.

Pandas

Pandas is a huge python library. Apparently, it is quite suited for data analysis and has a lot of goodies within it. For me, right now, it is a nice library that will let me open an .xlsx and save it as a CSV. BINGO!

Basic example
Multiple Sheets
Column Indices
CSV output style
Float or Int
Column Headers
No data type

Basic XLS to CSV

Opening one XLS in pandas is a delight. There’s simply one function to call: read_excel. It supports a bunch of options that will prove really useful, but for a quick test you can simply write something on the lines of:

    import pandas as pd
    data = pd.read_excel('myfile.xlsx')

Once the file is opened and parsed into the data variable, saving is equally easy:

    data.to_csv(outfile)

This is the vanilla way to do it. No options, and as you expect, something weird may happen. The rest of this post simply deals with options and sub-options to fine-grain our reading and writing process.

Multiple Sheets

It is a fact, an excel file, more often than not, has multiple sheets. I am just interested in one of the sheets, and it can be selected using the sheetname parameter on load.

    import pandas as pd
    data = pd.read_excel('myfile.xlsx', 'MySheet')

It is the second parameter, so I don’t bother with the name and just use it as a positional parameter.

Column index

Not my case, but you may have a specific column that acts as an index. The index_col parameter takes care of that. I set it to None.

    import pandas as pd
    data = pd.read_excel('myfile.xlsx', 'MySheet', 
                        index_col=None)

CSV Styling

There’s not a lot to do with the CSV format, nevertheless you can perform some basic tidy ups like: forcing an UTF-8 encoding, a semicolon separator and specify the float format.

    data.to_csv('output.csv', 
                encoding='utf-8', 
                sep=';', 
                float_format='%.2f')

Ints are floats?!

I’ve found some problems while dealing with integer data. The xlsx input does have columns that I know are integers, but pandas is actually treating those as floats.

So, after my first xlsx to CSV format change I got something like:

500000 (in excel) -> 499999.89999 (in CSV)

As you can imagine, this is not acceptable at all. For float values I don’t mind losing some precision in the process (I actually did not find any precision loss in my dataset), but integers should be integers.

This can be achieved forcing the casting type of each column. For that the converters parameters is our friend. It takes a dictionary with caster functions.


    converters = {
        'header1' : int,
        'header2' : float
    }

    data_xls = pd.read_excel(fileName, 
        'ResultsSheet', 
        index_col=None, 
        converters=converters, 
        na_values = ['NoData'])

No data values

Some of the input files differentiate between an empty set and no data value using a string. The na_values option keeps track of all the strings or values that should be treated as no_data.


    data_xls = pd.read_excel(fileName, 
        'ResultsSheet', 
        na_values = ['NoData', 'nd', 'nodata', 'NONE'])

Those values are read as None into the data_xls variable.

Column renaming

I noted that it’s frequent to find weird column headers. Mind me, not weird for a human, but weird when processing those columns. In my humble case, I can’t accept newlines in headers. They look nice in excel, but mess the CSV output.

What pandas offers is a rename function. It takes a dictionary of headers with the current name and the new value:

    headers = {
    'Header with \n newlines' : 'new header',
    'whateverName' : 'my new name',
    }

    data.rename(columns=headers, inplace=True)

To curate the input headers we just have to loop the existing headers and remove any unwanted characters. To get the headers from a pandas data frame, we use the list function.

    fileHeaders = list(data)
    newHeaders = {}
    for header in fileHeaders:
        newHeaders[header] = header.replace('\n','')

    data.rename(columns=newHeaders, inplace=True)

Conclusion

I am quite surprised at the pandas library and its wide range of options. While reading the documentation I’ve found a lot of stuff to like but this post only plays with a very very very small subset of what pandas can do.

References

Pandas website ⇒GO

XLS to CSV

Pandas

Basic XLS to CSV

Multiple Sheets

Column index

CSV Styling

Ints are floats?!

No data values

Column renaming

Conclusion

References

One response to “XLS to CSV”

Leave a comment Cancel reply

Currently:

Blogroll

Tags!

Follow by Email

RSS

XLS to CSV

Pandas

Basic XLS to CSV

Multiple Sheets

Column index

CSV Styling

Ints are floats?!

No data values

Column renaming

Conclusion

References

Bump:

Related

One response to “XLS to CSV”

Leave a comment Cancel reply

Currently:

Blogroll

Tags!

Follow by Email