pandas
pandas is a data analysis library for Python. It provides high-level data structures and analytical tools for data analysis.
Installation
pandas can be installed with pip
:
$ pip install pandas
or conda
:
$ conda install pandas
The Dataframe
The core data structure in pandas is the DataFrame
. A DataFrame
is a container for holding tabular data (2D), and supports labelled rows and columns.
You can create a DataFrame
by passing in a dict
, where each key is a column name (of string
type) and the value is a list
containing the data for that column (one entry per row):
df = pd.DataFrame({ 'Name': [ 'John', 'Geoff', 'Brett' ], 'Age': [ 45, 23, 30 ], 'Height': [ 1.23, 4.56, 7.89 ],})
You can then print
the dataframe, and pandas will render the data nicely in a tabular form:
print(df)# Name Age Height# 0 John 45 1.23# 1 Geoff 23 4.56# 2 Brett 30 7.89
Selecting Columns
You can then select (extract) certain columns of data by passing in a list
of the column names you want:
print(df[['Name', 'Height']])# Name Height# 0 John 1.23# 1 Geoff 4.56# 2 Brett 7.8
The command above returns a dataframe.
Selecting Rows Based On A Column Value
To select all rows in a dataframe in where a particular column has a certain value (filtering), use the following code:
df.loc[df['column_name'] == some_value]
This returns a new dataframe with only the applicable rows included.
For example:
import pandas as pd
df = pd.DataFrame({ 'A': [ 1, 5, 6, 3, 4 ], 'B': [ 'foo', 'bar', 'bar', 'foo', 'foo' ]})print(df)# A B# 0 1 foo# 1 5 bar# 2 6 bar# 3 3 foo# 4 4 foo
filtered_df = df.loc[df['B'] == 'foo']print(filtered_df)# A B# 0 1 foo# 3 3 foo# 4 4 foo
For more advanced selection criteria, you can provide your own filter function, which takes one argument, the current row
:
def filter(row): # This will filter out all rows where A is not an even number. if row['A'] % 2 == 0: return True else: return False
filt_df = df.loc[df.apply(filter, axis=1)]print(filt_df)# A B# 2 6 bar# 4 4 foo
Sorting
You can sort a dataframe by a specific column using the sort_values()
function, providing a column name to the in
parameter to specify what column to sort by:
df = pd.DataFrame({ 'A': [ 1, 5, 6, 3, 4 ], 'B': [ 'foo', 'bar', 'bar', 'foo', 'foo' ]})sorted_df = df.sort_values(by='A')print(sorted_df)# A B# 0 1 foo# 3 3 foo# 4 4 foo# 1 5 bar# 2 6 bar
By default, pandas
will sort in ascending order. To sort in descending order, provide the optional parameter ascending=False
:
sorted_df = df.sort_values(by='A')print(sorted_df, ascending=False)# A B# 2 6 bar# 1 5 bar# 4 4 foo# 3 3 foo# 0 1 foo
Parsing CSV Files
pandas
has first-tier support for CSV files. It can load in a CSV file directly into a DataFrame
, ready for analysing, without having to write any line-by-line CSV parsing. It will also label the columns if the CSV file has a header row (which is recommended!).
To load a CSV file into a DataFrame
:
df = pandas.read_csv('file_path.csv')
Integration With Jupyter
pandas
has good integration into Jupyter. It can render dataframes as formatted and styled HTML tables, either by typing the dataframe variable on the last line of a cell or by using the display(my_dataframe)
syntax. When dealing with large amounts of data inside a dataframe, it will truncate internal cells (with ...
) to limit the table height and width (similar to when you print a large numpy
array). Typically you should always leverage Jupyter’s dataframe rendering ability, rather than using print(my_dataframe)
(which just prints the dataframe as a string).
Merging Tables (VLOOKUP Equivalent)
Pandas provides the ability to merge tables together in a similar fashion to the VLOOKUP
function in Excel, or similar to a JOIN
in SQL.
The syntax for merge()
is:
pd.merge(left, right, left_on, right_on, how)
The different types of joins (the how
parameter), which follow the same naming convention as SQL:
inner
: Rows which have matching values in both tables.left
: All rows in the left table plus those matching from the right.right
: All rows in the right table plus those matching from the left.outer
: All rows from both tables.