import sys
import pandas as pd
import numpy as np
print(f"python version {sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}")
print(f"pandas version {pd.__version__}")
print(f"numpy version {np.__version__}")
The default data type for strings in Pandas DataFrames is the object
type. However, pandas' documentation recommendeds explicitly using the StringDtype
for storing strings as it's more efficient and allows for more specific string operations.
pd.StringDtype()
is a dedicated data type for storing strings. It is an It allows for more specific string operations. StringDtype
is still considered experimental as of pandas
2.0.1.
pandas
can utilize PyArrow with StringDtype
by using pd.StringDtype(storage="pyarrow")
.
pd.StringDtype()
pd.StringDtype(storage="pyarrow")
object
types are a more general data type that can store any type of data, including strings. However, object types are less efficient for storing strings, and they do not allow for specific string operations.
Despite StringDtype
being recommended by the official Pandas documentation, distinguishing between StringDType
and object
is rarely required. Having to specify a speicfic string data type feels like a nuisance. When does it actually matter?
Pandas string types¶
There are three string data types available in pandas
(as of version 2.0.0).
1. object
¶
This is the default data type for strings.
s_object = pd.Series(['hello', 'world'])
display(s_object)
print(s_object.dtype)
2. StringDtype
¶
This is an extension data type for string data and was introduced in pandas
1.0.
s_string = pd.Series(['hello', 'world'], dtype=pd.StringDtype())
display(s_string)
print(s_string.dtype)
"string"
is an alias for pd.StringDtype()
.
s_string2 = pd.Series(['hello', 'world'], dtype=pd.StringDtype())
display(s_string2)
3. StringDtype
(PyArrow)¶
This is another extension data type for string data. It uses a columnar memory format.
s_string_pyarrow = pd.Series(['hello', 'world'], dtype=pd.StringDtype("pyarrow"))
display(s_string_pyarrow)
print(s_string_pyarrow.dtype)
"string[pyarrow]"
is an alias for pd.StringDtype("pyarrow")
.
s_string_pyarrow2 = pd.Series(['hello', 'world'], dtype="string[pyarrow]")
display(s_string_pyarrow2)
Q: Why does pandas
use the object
data type for strings?¶
pandas
is built on top of numpy
. numpy
uses a fixed-width string data type, similar to how C uses a char
array to represent a string. Using numpy
's fixed-width string representation would not bode well with data analysis. pandas
uses Python's native string data type as a workaround.
NumPy's byte-sized char
representation¶
arr_s6
array is a numpy array of strings with a maximum length of 5. The zero-terminated byte is not included in the character count.
arr_s5 = np.array(['hello', 'world'], dtype='|S5')
arr_s5
Attempting to assign a string that exceeds the maximum length will result in truncated elements.
arr_s5[1] = 'a whole new world'
arr_s5[1]
Using variable-width strings in NumPy¶
To use a variable-width string data type, use the object
type. NumPy will use the native Python string data type.
arr_obj = np.array(['hello', 'world'], dtype=object)
arr_obj
Attempting to assign a string that exceeds the maximum length will work without an issue.
arr_obj[1] = 'a whole new world'
arr_obj[1]
Question: Why does pandas
use the object
data type for strings?
Answer: Using a fixed-width string data type would be too limiting for most analytical applications. Pandas used Python-native string type to support variable-with strings before the string extension type was added.
Comparison 1: Memory efficiency¶
How does the three data pandas string data types stack up against each other in terms of memory efficiency?
Let's compare the memory usage of million strings with a uniform length of 8.
random_strings = np.random.randint(
low=10 ** 7,
high=10 ** 8,
size=100000
).astype(str)
random_strings[:10]
# object type
s_obj = pd.Series(random_strings)
print(f"dtype object uses {s_obj.memory_usage(deep=True)} bytes")
# StringDtype
# dtype="string" is an alias for dtype=pd.StringDtype()
s_string = pd.Series(
random_strings,
dtype="string"
)
print(f"dtype string uses {s_string.memory_usage(deep=True)} bytes")
# StringDtype with PyArrow
# dtype="string[pyarrow]" is an alias for dtype=pd.StringDtype("pyarrow")
s_string_pyarrow = pd.Series(
random_strings,
dtype="string[pyarrow]"
)
print(f"dtype string[pyarrow] uses {s_string_pyarrow.memory_usage(deep=True)} bytes")
import plotly.express as px
fig = px.bar(
x=['object', 'string', 'string[pyarrow]'],
y=[
s_obj.memory_usage(deep=True) / 10 ** 6,
s_string.memory_usage(deep=True) / 10 ** 6,
s_string_pyarrow.memory_usage(deep=True) / 10 ** 6
],
text=[
f"{round(s_obj.memory_usage(deep=True) / 10 ** 6, 1)} MB",
f"{round(s_string.memory_usage(deep=True) / 10 ** 6, 1)} MB",
f"{round(s_string_pyarrow.memory_usage(deep=True) / 10 ** 6, 1)} MB",
],
title='Pandas memory usages of million strings by data type (lower is better)',
template="simple_white"
)
fig.update_layout(
xaxis_title="Data Type",
yaxis_title="Memory Usage in MB",
)
fig.show()