A common situation for us (people in the programming/computing/processing world) is that we don’t always work with the same tools as some of our non-tech peers.
Case in point, I received a big bunch of files in XLS/XLSX format, very big files, LibreOffice has trouble working with them. Since I want to perform quick processing on that data, and I already have scripts that process similar data in CSV, the simplest path is to transform those files to plain, ugly, useful CSV files.
Then again, there are 100 files, and I don’t feel like dancing around each one: opening, clicking save as, selecting CSV, telling LibreOffice that this is a semicolon separated CSV file … etc etc.
Entering pandas.
Pandas
Pandas is a huge python library. Apparently, it is quite suited for data analysis and has a lot of goodies within it. For me, right now, it is a nice library that will let me open an .xlsx and save it as a CSV. BINGO!
- Basic example
- Multiple Sheets
- Column Indices
- CSV output style
- Float or Int
- Column Headers
- No data type
Basic XLS to CSV
Opening one XLS in pandas is a delight. There’s simply one function to call: read_excel. It supports a bunch of options that will prove really useful, but for a quick test you can simply write something on the lines of:
import pandas as pd data = pd.read_excel('myfile.xlsx')
Once the file is opened and parsed into the data variable, saving is equally easy:
data.to_csv(outfile)
This is the vanilla way to do it. No options, and as you expect, something weird may happen. The rest of this post simply deals with options and sub-options to fine-grain our reading and writing process.
Multiple Sheets
It is a fact, an excel file, more often than not, has multiple sheets. I am just interested in one of the sheets, and it can be selected using the sheetname
parameter on load.
import pandas as pd data = pd.read_excel('myfile.xlsx', 'MySheet')
It is the second parameter, so I don’t bother with the name and just use it as a positional parameter.
Column index
Not my case, but you may have a specific column that acts as an index. The index_col
parameter takes care of that. I set it to None.
import pandas as pd data = pd.read_excel('myfile.xlsx', 'MySheet', index_col=None)
CSV Styling
There’s not a lot to do with the CSV format, nevertheless you can perform some basic tidy ups like: forcing an UTF-8 encoding, a semicolon separator and specify the float format.
data.to_csv('output.csv', encoding='utf-8', sep=';', float_format='%.2f')
Ints are floats?!
I’ve found some problems while dealing with integer data. The xlsx input does have columns that I know are integers, but pandas is actually treating those as floats.
So, after my first xlsx to CSV format change I got something like:
500000 (in excel) -> 499999.89999 (in CSV)
As you can imagine, this is not acceptable at all. For float values I don’t mind losing some precision in the process (I actually did not find any precision loss in my dataset), but integers should be integers.
This can be achieved forcing the casting type of each column. For that the converters
parameters is our friend. It takes a dictionary with caster functions.
converters = { 'header1' : int, 'header2' : float } data_xls = pd.read_excel(fileName, 'ResultsSheet', index_col=None, converters=converters, na_values = ['NoData'])
No data values
Some of the input files differentiate between an empty set and no data value using a string. The na_values
option keeps track of all the strings or values that should be treated as no_data.
data_xls = pd.read_excel(fileName, 'ResultsSheet', na_values = ['NoData', 'nd', 'nodata', 'NONE'])
Those values are read as None into the data_xls
variable.
Column renaming
I noted that it’s frequent to find weird column headers. Mind me, not weird for a human, but weird when processing those columns. In my humble case, I can’t accept newlines in headers. They look nice in excel, but mess the CSV output.
What pandas offers is a rename
function. It takes a dictionary of headers with the current name and the new value:
headers = { 'Header with \n newlines' : 'new header', 'whateverName' : 'my new name', } data.rename(columns=headers, inplace=True)
To curate the input headers we just have to loop the existing headers and remove any unwanted characters. To get the headers from a pandas data frame, we use the list
function.
fileHeaders = list(data) newHeaders = {} for header in fileHeaders: newHeaders[header] = header.replace('\n','') data.rename(columns=newHeaders, inplace=True)
Conclusion
I am quite surprised at the pandas library and its wide range of options. While reading the documentation I’ve found a lot of stuff to like but this post only plays with a very very very small subset of what pandas can do.
References
Pandas website ⇒GO
Pingback: The dropped posts, and 2017 closing | Castells