Data Structures–Python vs. NumPy vs. Pandas: Why care? How they work?

Feature image is my college computer science textbook–little did I know I would revisit “data structures” 4 years later.

Why?

Why should we care about data structures in data science?

Actuaries and other business professionals often deal with data small in size and stored in user-friendly tools such as Microsoft Excel and Access. Their data is usually presented in a clean and intuitive 2-dimensional table format. Data scientists, however, aren’t so lucky. They need to deal with raw data which are large in size and ugly in forms.

When I was in my previous actuarial pricing role, most of my analyses were done in Excel (a software many actuaries including myself love and hate at the same time). In my current data science roles, Excel simply can’t handle the amount of data (e.g. one of my tables is over 500GB in size–it has over a billion rows and hundreds of columns).

The GIF below can perfectly illustrate this point: the flour (data) being thrown at actuaries is often in bags (nice tangible tables) while data scientists often get loose flour (a large amount of raw data).

Image result for gif flour thrown at you — “Data is like a pile of flour. You never know what you’re gonna get.”

Once data is brought into the Python environment, understanding data structures and using the right operations would allow us to easily bake the flour (data) into cakes which you can slice and eat (analyze) them in whatever ways you desire.

Image result for tortas de cubo magico — This Rubik’s Cube cake represents a 3-dimensional array

How?

How do they work?

Python has its own data structure with the following building blocks: list, tuple, set, and dictionary. Pandas and NumPy are Python packages–extra baking tools you can use to more effectively manipulate the flour (data), and they have their own data structures (similar to Python’s data structures but easier to deal with).

So what is the relationship among Python’s native data structures, Pandas, and NumPy?

Modeled after Matlab, NumPy adds to Python multi-dimensional arrays, matrices, and vectorized mathematical functions. Although similar to sets in Python, so it’s much faster than using loops in Python, even faster than list comprehensions.
Pandas was built on top of NumPy.

What?

If there are only one or two things you should remember about each of the three things discussed above, what would they be?

Python data structure‘s four building blocks: list, tuple, set, and dictionary.

View this document on Scribd

NumPy‘s most important object is array, an N-dimensional array, a collection of items of the same type. To access information stored in the array, you need index & slicing. Basic slicing is constructed by start:stop:step notation inside of brackets, HERE is a great intro to indexing.


>>> import numpy as np
>>> x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

>>> x[1:7:2]
array([1, 3, 5]) #in Python, index starts at 0

>>> x[-2:10]
array([8, 9])

>>> x[5:]
array([5, 6, 7, 8, 9])

Pandas: there are two Pandas data structures you should know: Series and DataFrame.

Pandas Series ≈ NumPy’s ndarray ≈ Python’s dictionary.
Pandas DataFrame ≈ a dictionary of Series ≈ 2-dimensional table with index (row labels) and columns (column labels).


In [1]: import numpy as np
In [2]: import pandas as pd

In [3]: d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
   ....:      'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
   ....: 

In [4]: df = pd.DataFrame(d)

In [5]: df
Out[6]: 
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

To access information stored in DataFrame, you also need indexing. Below is a quick summary; for more details, check out this tutorial I found very helpful.

Operation	Syntax	Result
Select column	`df[col]`	Series
Select row by label	`df.loc[label]`	Series
Select row by integer location	`df.iloc[loc]`	Series
Slice rows	`df[5:10]`	DataFrame
Select rows by boolean vector	`df[bool_vec]`	DataFrame

Other learning resources I found helpful:

Given there are so much expertly developed learning resources out there, I feel it’s more valuable for me to include links to helpful resources + my commentary here versus reinventing the wheel.

Numpy: Quoting Python for Data Analysis by Wes McKinney, “While NumPy by itself does not provide very much high-level data analytical functionality, having an understanding of NumPy arrays and array-oriented computing will help you use tools like pandas much more effectively. ” Below is a great good intro to NumPy: https://www.oreilly.com/library/view/python-for-data/9781449323592/ch04.htm l