Microsoft Excel is the most widely-used spreadsheet software in the world, and for good reason: the user-friendly interface and powerful built-in tools make it simple to work with data.
But if you want to do more advanced data processing, you’ll need to go beyond Excel’s capabilities and start using a scripting/programming language like Python. Rather than manually copying your data into databases, here’s a quick tutorial on how to load your Excel data into Python using Pandas.
What Is Pandas?
Python Data Analysis Library (“Pandas”) is an open-source library for the Python programming language that’s used for data analysis and data manipulation.
Pandas loads data into Python objects known as Dataframes, which store data in rows and columns just like a traditional database. Once a Dataframe is created it can be manipulated using Python, opening up a world of possibilities.
Note: You must have Python 2.7 or later to install Pandas.
To begin working with Pandas on your machine you will need to import the Pandas library. If you’re in search of a heavyweight solution you can download the Anaconda Python Distribution, which has Pandas built-in. If you don’t have a use for Anaconda, Pandas is simple to install in your terminal.
Pandas is a PyPI package, which means you can install using PIP for Python via the command line. Modern Mac systems come with PIP. For other Windows, Linux, and older systems it’s easy to learn how to install PIP for Python.
Once you’ve opened your terminal, the latest version of Pandas can be installed using the command:
>> pip install pandas
Pandas also requires the NumPy library, let’s also install this on the command line:
>> pip install numpy
You now have Pandas installed and ready to create your first DataFrame!
Prepping the Excel Data
For this example, let’s use a sample data set: an Excel workbook titled Cars.xlsx.
This data set displays the make, model, color, and year of cars entered into the table. The table is displayed as an Excel range. Pandas is smart enough to read the data appropriately.
This workbook is saved to the Desktop directory, here is the file path used:
You will need to know the file path of the workbook to utilize Pandas. Let’s begin by opening up Visual Studio Code to write the script. If you don’t have a text editor, we recommend either Visual Studio Code or Atom Editor.
Writing the Python Script
Now that you have your text editor of choice, the real fun begins. We’re going to bring together Python and our Cars workbook to create a Pandas DataFrame.
Importing the Python Libraries
Open your text editor and create a new Python file. Let’s call it Script.py.
In order to work with Pandas in your script, you will need to import it into your code. This is done with one line of code:
import pandas as pd
Here we are loading the Pandas library and attaching it to a variable “pd”. You can use any name you would like, we are using “pd” as short for Pandas.
To work with Excel using Pandas, you need an additional object named ExcelFile. ExcelFile is built into the Pandas ecosystem, so you import directly from Pandas:
from pandas import ExcelFile
Working With the File Path
In order to give Pandas access to your workbook, you need to direct your script to the location of the file. The easiest way to do this is by providing your script with the full path to the workbook.
Recall our path in this example: /Users/grant/Desktop/Cars.xlsx
You will need this file path referenced in your script to extract the data. Rather than referencing the path inside of the Read_Excel function, keep code clean by storing the path in a variable:
Cars_Path = '/Users/grant/Desktop/Cars.xlsx'
You are now ready to extract the data using a Pandas function!
Extract Excel Data Using Pandas.Read_Excel()
With Pandas imported and your path variable set, you can now utilize functions in the Pandas object to accomplish our task.
The function you will need to use is appropriately named Read_Excel. The Read_Excel function takes the file path of an Excel Workbook and returns a DataFrame object with the contents of the Workbook. Pandas codes this function as:
The “path” argument is going to be the path to our Cars.xlsx workbook, and we have already set the path string to the variable Cars_Path.
You’re ready to create the DataFrame object! Let’s put it all together and set the DataFrame object to a variable named “DF”:
DF = pd.read_excel(Cars_Path)
Lastly, you want to view the DataFrame so let’s print the result. Add a print statement to the end of your script, using the DataFrame variable as the argument:
Time to run the script in your terminal!
Running the Python Script
Open your terminal or command line, and navigate to the directory which houses your script. In this case, I have “Script.py” located on the desktop. To execute the script, use the python command followed by the script file:
Python will pull the data from “Cars.xlsx” into your new DataFrame, and print the DataFrame to the terminal!
A Closer Look at the DataFrame Object
At first glance, the DataFrame looks very similar to a regular Excel table. Pandas DataFrames are easy to interpret as a result.
Your headers are labeled at the top of the data set, and Python has filled in the rows with all your information read from the “Cars.xlsx” workbook.
Notice the leftmost column, an index starting at 0 and numbering the columns. Pandas will apply this index to your DataFrame by default, which can be useful in some cases. If you do not want this index generated, you can add an additional argument into your code:
DF = pd.read_excel(Cars_Path, index=False)
Setting the argument “index” to False will remove the index column, leaving you with just your Excel data.
Doing More With Python
Now that you have the ability to read data from Excel worksheets, you can apply Python programming any way you choose. Working with Pandas is a simple way for experienced Python programmers to work with data stored in Excel Workbooks.
The ease with which Python can be used to analyze and manipulate data is one of the many reasons why Python is the programming language of the future.
Image Credit: Rawpixel/Depositphotos