Last updated August 10, 2022
The Pandas Data Analysis Library provides a way of bringing SQL-like sorting and querying to semi-structured data, through Python. These examples provided below were shamelessly lifted from the book, "Python for Data Analysis."
Installing Python Pandas:
From the command line, install the Python package manager pip if you haven't done so yet:
sudo apt-get install python-pipPandas requires numpy, so install both from pip:
sudo pip install numpy sudo pip install pandasAnd at the start of your Python program you need to alert the compiler of the necessary libraries:
from pandas import Series from pandas import DataFrame import pandas as pd
Working with Arrays: Series
(You can run the code below from this file)
To know pandas you need to know all about series and data frames. Let's start with a series. A series is a one-dimensional array (or object) of data and an index. Pandas will let you create a series:obj = Series([ 13, 23, 2, 15])If no index is present, one will be created automatically. You can create a series and define the index:
obj2 = Series([ 4, 7, -5, 3], index =['d', 'b', 'a', 'c']) obj2['d'] = 6Use the index to assign a certain value:
IndexedSeries['a'] = 14;You can create a series from a Python Dict:
Dict2SeriesData = {'Monday': 2200, 'Tuesday': 3528, 'Wednesday': 123299, 'Thursday': 3234}
Dict2Series = Series(Dict2SeriesData)
Sort a Series by providing the sorting order
(Note: Pandas will assign a NaN to any values it does not find):
Days = ['Wednesday', 'Friday', 'Monday', 'Tuesday'] SortedDays = Series(Dict2SeriesData, index=Days)You can combine two series into a single one:
Dict3SeriesData = {'Monday': 1400, 'Tuesday': 10000, 'Wednesday': 5, 'Sunday': 2365}
Dict3Series = Series(Dict3SeriesData)
Dailies = Dict3Series + Dict2Series
Working with Arrays: Data Frames
A data frame is a two-dimensional labeled data structure (of potentially different data types) that resembles a spreadsheet. It has an index for both the row and the column (Operational code samples for this section are available here).
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2001, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
print frame
Reading in Data:
The next example requires users.dat, ratings.dat, and movies.dat. Run the code here.
#Run these commands in iPython, or as a stand-alone Python program
import pandas as pd
unames = [' user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('users.dat', sep ='::', header = None, names = unames)
rnames = [' user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ratings.dat', sep ='::', header = None, names = rnames)
mnames = [' movie_id', 'title', 'genres']
movies = pd.read_table('movies.dat', sep ='::', header = None, names = mnames)
users[: 5]
movies[: 5]
ratings
data = pd.merge( ratings, users)
active_titles = ratings_by_title.index[ ratings_by_title > = 250]