In [1]:
import sys
import pandas as pd
import numpy as np

print(f"python version {sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}")
print(f"pandas version {pd.__version__}")
print(f"numpy version {np.__version__}")
python version 3.10.11
pandas version 1.5.2
numpy version 1.23.5

The default data type for strings in Pandas DataFrames is the object type. However, pandas' documentation recommendeds explicitly using the StringDtype for storing strings as it's more efficient and allows for more specific string operations.

pd.StringDtype() is a dedicated data type for storing strings. It is an It allows for more specific string operations. StringDtype is still considered experimental as of pandas 2.0.1.

pandas can utilize PyArrow with StringDtype by using pd.StringDtype(storage="pyarrow").

In [2]:
pd.StringDtype()
Out[2]:
string[python]
In [3]:
pd.StringDtype(storage="pyarrow")
Out[3]:
string[pyarrow]

object types are a more general data type that can store any type of data, including strings. However, object types are less efficient for storing strings, and they do not allow for specific string operations.

Despite StringDtype being recommended by the official Pandas documentation, distinguishing between StringDType and object is rarely required. Having to specify a speicfic string data type feels like a nuisance. When does it actually matter?

String

Pandas string types

There are three string data types available in pandas (as of version 2.0.0).

1. object

This is the default data type for strings.

In [4]:
s_object = pd.Series(['hello', 'world'])
display(s_object)
0    hello
1    world
dtype: object
In [5]:
print(s_object.dtype)
object

2. StringDtype

This is an extension data type for string data and was introduced in pandas 1.0.

In [6]:
s_string = pd.Series(['hello', 'world'], dtype=pd.StringDtype())
display(s_string)
0    hello
1    world
dtype: string
In [7]:
print(s_string.dtype)
string

"string" is an alias for pd.StringDtype().

In [8]:
s_string2 = pd.Series(['hello', 'world'], dtype=pd.StringDtype())
display(s_string2)
0    hello
1    world
dtype: string

3. StringDtype (PyArrow)

This is another extension data type for string data. It uses a columnar memory format.

In [9]:
s_string_pyarrow = pd.Series(['hello', 'world'], dtype=pd.StringDtype("pyarrow"))
display(s_string_pyarrow)
0    hello
1    world
dtype: string
In [10]:
print(s_string_pyarrow.dtype)
string

"string[pyarrow]" is an alias for pd.StringDtype("pyarrow").

In [11]:
s_string_pyarrow2 = pd.Series(['hello', 'world'], dtype="string[pyarrow]")
display(s_string_pyarrow2)
0    hello
1    world
dtype: string

Q: Why does pandas use the object data type for strings?

pandas is built on top of numpy. numpy uses a fixed-width string data type, similar to how C uses a char array to represent a string. Using numpy's fixed-width string representation would not bode well with data analysis. pandas uses Python's native string data type as a workaround.

2023-05-26 04_46_59-analytics-master-slides - Google Slides

NumPy's byte-sized char representation

arr_s6 array is a numpy array of strings with a maximum length of 5. The zero-terminated byte is not included in the character count.

In [12]:
arr_s5 = np.array(['hello', 'world'], dtype='|S5')
arr_s5
Out[12]:
array([b'hello', b'world'], dtype='|S5')

Attempting to assign a string that exceeds the maximum length will result in truncated elements.

In [13]:
arr_s5[1] = 'a whole new world'
arr_s5[1]
Out[13]:
b'a who'

Using variable-width strings in NumPy

To use a variable-width string data type, use the object type. NumPy will use the native Python string data type.

In [14]:
arr_obj = np.array(['hello', 'world'], dtype=object)
arr_obj
Out[14]:
array(['hello', 'world'], dtype=object)

Attempting to assign a string that exceeds the maximum length will work without an issue.

In [15]:
arr_obj[1] = 'a whole new world'
arr_obj[1]
Out[15]:
'a whole new world'

Question: Why does pandas use the object data type for strings?

Answer: Using a fixed-width string data type would be too limiting for most analytical applications. Pandas used Python-native string type to support variable-with strings before the string extension type was added.

Comparison 1: Memory efficiency

How does the three data pandas string data types stack up against each other in terms of memory efficiency?

Let's compare the memory usage of million strings with a uniform length of 8.

In [16]:
random_strings = np.random.randint(
    low=10 ** 7,
    high=10 ** 8,
    size=100000
).astype(str)

random_strings[:10]
Out[16]:
array(['15589323', '20800295', '41036913', '68823562', '51830538',
       '76865317', '59543769', '15347449', '51126384', '25913394'],
      dtype='<U11')
In [17]:
# object type
s_obj = pd.Series(random_strings)
print(f"dtype object uses {s_obj.memory_usage(deep=True)} bytes")

# StringDtype
# dtype="string" is an alias for dtype=pd.StringDtype()
s_string = pd.Series(
    random_strings,
    dtype="string"
)
print(f"dtype string uses {s_string.memory_usage(deep=True)} bytes")

# StringDtype with PyArrow
# dtype="string[pyarrow]" is an alias for dtype=pd.StringDtype("pyarrow")
s_string_pyarrow = pd.Series(
    random_strings,
    dtype="string[pyarrow]"
)
print(f"dtype string[pyarrow] uses {s_string_pyarrow.memory_usage(deep=True)} bytes")
dtype object uses 6500128 bytes
dtype string uses 6500128 bytes
dtype string[pyarrow] uses 1200128 bytes
In [18]:
import plotly.express as px

fig = px.bar(
    x=['object', 'string', 'string[pyarrow]'],
    y=[
        s_obj.memory_usage(deep=True) / 10 ** 6,
        s_string.memory_usage(deep=True) / 10 ** 6,
        s_string_pyarrow.memory_usage(deep=True) / 10 ** 6
    ],
    text=[
        f"{round(s_obj.memory_usage(deep=True) / 10 ** 6, 1)} MB",
        f"{round(s_string.memory_usage(deep=True) / 10 ** 6, 1)} MB",
        f"{round(s_string_pyarrow.memory_usage(deep=True) / 10 ** 6, 1)} MB",
    ],
    title='Pandas memory usages of million strings by data type (lower is better)',
    template="simple_white"
)

fig.update_layout(
    xaxis_title="Data Type",
    yaxis_title="Memory Usage in MB",
)

fig.show()