pandas is a data analysis library for Python. It provides high-level data structures and analytical tools for data analysis.
pandas can be installed with
The core data structure in pandas is the
DataFrame is a container for holding tabular data (2D), and supports labelled rows and columns.
You can create a
DataFrame by passing in a
dict, where each key is a column name (of
string type) and the value is a
list containing the data for that column (one entry per row):
You can then
You can then select (extract) certain columns of data by passing in a
list of the column names you want:
The command above returns a dataframe.
Selecting Rows Based On A Column Value
To select all rows in a dataframe in where a particular column has a certain value (filtering), use the following code:
This returns a new dataframe with only the applicable rows included.
For more advanced selection criteria, you can provide your own filter function, which takes one argument, the current
You can sort a dataframe by a specific column using the
sort_values() function, providing a column name to the
in parameter to specify what column to sort by:
pandas will sort in ascending order. To sort in descending order, provide the optional parameter
Parsing CSV Files
pandas has first-tier support for CSV files. It can load in a CSV file directly into a
DataFrame, ready for analysing, without having to write any line-by-line CSV parsing. It will also label the columns if the CSV file has a header row (which is recommended!).
To load a CSV file into a
Integration With Jupyter
pandas has good integration into Jupyter. It can render dataframes as formatted and styled HTML tables, either by typing the dataframe variable on the last line of a cell or by using the
display(my_dataframe) syntax. When dealing with large amounts of data inside a dataframe, it will truncate internal cells (with
...) to limit the table height and width (similar to when you print a large
numpy array). Typically you should always leverage Jupyter’s dataframe rendering ability, rather than using
print(my_dataframe) (which just prints the dataframe as a string).
Merging Tables (VLOOKUP Equivalent)
Pandas provides the ability to merge tables together in a similar fashion to the
VLOOKUP function in Excel, or similar to a
JOIN in SQL.
The syntax for
The different types of joins (the
how parameter), which follow the same naming convention as SQL:
inner: Rows which have matching values in both tables.
left: All rows in the left table plus those matching from the right.
right: All rows in the right table plus those matching from the left.
outer: All rows from both tables.
This work is licensed under a Creative Commons Attribution 4.0 International License .