Skip to content
Published On:
Jun 11, 2019
Last Updated:
Mar 26, 2025

pandas is a data analysis library for Python. It provides high-level data structures and analytical tools for data analysis.

The pandas logo.

Installation

pandas can be installed with pip:

Terminal window
$ pip install pandas

or conda:

Terminal window
$ conda install pandas

The Dataframe

The core data structure in pandas is the DataFrame. A DataFrame is a container for holding tabular data (2D), and supports labelled rows and columns.

You can create a DataFrame by passing in a dict, where each key is a column name (of string type) and the value is a list containing the data for that column (one entry per row):

df = pd.DataFrame({
'Name': [ 'John', 'Geoff', 'Brett' ],
'Age': [ 45, 23, 30 ],
'Height': [ 1.23, 4.56, 7.89 ],
})

You can then print the dataframe, and pandas will render the data nicely in a tabular form:

print(df)
# Name Age Height
# 0 John 45 1.23
# 1 Geoff 23 4.56
# 2 Brett 30 7.89

Selecting Columns

You can then select (extract) certain columns of data by passing in a list of the column names you want:

print(df[['Name', 'Height']])
# Name Height
# 0 John 1.23
# 1 Geoff 4.56
# 2 Brett 7.8

The command above returns a dataframe.

Selecting Rows Based On A Column Value

To select all rows in a dataframe in where a particular column has a certain value (filtering), use the following code:

df.loc[df['column_name'] == some_value]

This returns a new dataframe with only the applicable rows included.

For example:

import pandas as pd
df = pd.DataFrame({
'A': [ 1, 5, 6, 3, 4 ],
'B': [ 'foo', 'bar', 'bar', 'foo', 'foo' ]
})
print(df)
# A B
# 0 1 foo
# 1 5 bar
# 2 6 bar
# 3 3 foo
# 4 4 foo
filtered_df = df.loc[df['B'] == 'foo']
print(filtered_df)
# A B
# 0 1 foo
# 3 3 foo
# 4 4 foo

For more advanced selection criteria, you can provide your own filter function, which takes one argument, the current row:

def filter(row): # This will filter out all rows where A is not an even number.
if row['A'] % 2 == 0:
return True
else:
return False
filt_df = df.loc[df.apply(filter, axis=1)]
print(filt_df)
# A B
# 2 6 bar
# 4 4 foo

Sorting

You can sort a dataframe by a specific column using the sort_values() function, providing a column name to the in parameter to specify what column to sort by:

df = pd.DataFrame({
'A': [ 1, 5, 6, 3, 4 ],
'B': [ 'foo', 'bar', 'bar', 'foo', 'foo' ]
})
sorted_df = df.sort_values(by='A')
print(sorted_df)
# A B
# 0 1 foo
# 3 3 foo
# 4 4 foo
# 1 5 bar
# 2 6 bar

By default, pandas will sort in ascending order. To sort in descending order, provide the optional parameter ascending=False:

sorted_df = df.sort_values(by='A')
print(sorted_df, ascending=False)
# A B
# 2 6 bar
# 1 5 bar
# 4 4 foo
# 3 3 foo
# 0 1 foo

Parsing CSV Files

pandas has first-tier support for CSV files. It can load in a CSV file directly into a DataFrame, ready for analysing, without having to write any line-by-line CSV parsing. It will also label the columns if the CSV file has a header row (which is recommended!).

To load a CSV file into a DataFrame:

df = pandas.read_csv('file_path.csv')

Be warned that if you have whitespace between values in your CSV file, you may need to do some extra massaging to get the correct data. One easy way to this is to pass skipinitialspace=True into the read_csv() function:

df = pandas.read_csv('file_path.csv', skipinitialspace=True)

This will work as long as the white space is always prior to value, and there is no whitespace between the end of the value and the separator (e.g. comma). This is normally the case if whitespace was used to visually align the columns in the CSV file.

Integration With Jupyter

pandas has good integration into Jupyter. It can render dataframes as formatted and styled HTML tables, either by typing the dataframe variable on the last line of a cell or by using the display(my_dataframe) syntax. When dealing with large amounts of data inside a dataframe, it will truncate internal cells (with ...) to limit the table height and width (similar to when you print a large numpy array). Typically you should always leverage Jupyter’s dataframe rendering ability, rather than using print(my_dataframe) (which just prints the dataframe as a string).

Merging Tables (VLOOKUP Equivalent)

Pandas provides the ability to merge tables together in a similar fashion to the VLOOKUP function in Excel, or similar to a JOIN in SQL.

The syntax for merge() is:

pd.merge(left, right, left_on, right_on, how)

The different types of joins (the how parameter), which follow the same naming convention as SQL:

  • inner: Rows which have matching values in both tables.
  • left: All rows in the left table plus those matching from the right.
  • right: All rows in the right table plus those matching from the left.
  • outer: All rows from both tables.