import sys
import pandas as pd
import numpy as np
print(f"python version {sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}")
print(f"pandas version {pd.__version__}")
print(f"numpy version {np.__version__}")
The default data type for strings in Pandas DataFrames is the object
type. However, pandas' documentation recommendeds explicitly using the StringDtype
for storing strings as it's more efficient and allows for more specific string operations.
pd.StringDtype()
is a dedicated data type for storing strings. It is an It allows for more specific string operations. StringDtype
is still considered experimental as of pandas
2.0.1.
pandas
can utilize PyArrow with StringDtype
by using pd.StringDtype(storage="pyarrow")
.
pd.StringDtype()
pd.StringDtype(storage="pyarrow")
object
types are a more general data type that can store any type of data, including strings. However, object types are less efficient for storing strings, and they do not allow for specific string operations.
Despite StringDtype
being recommended by the official Pandas documentation, distinguishing between StringDType
and object
is rarely required. Having to specify a speicfic string data type feels like a nuisance. When does it actually matter?
Pandas string types¶
There are three string data types available in pandas
(as of version 2.0.0).
1. object
¶
This is the default data type for strings.
s_object = pd.Series(['hello', 'world'])
display(s_object)
print(s_object.dtype)
2. StringDtype
¶
This is an extension data type for string data and was introduced in pandas
1.0.
s_string = pd.Series(['hello', 'world'], dtype=pd.StringDtype())
display(s_string)
print(s_string.dtype)
"string"
is an alias for pd.StringDtype()
.
s_string2 = pd.Series(['hello', 'world'], dtype=pd.StringDtype())
display(s_string2)
3. StringDtype
(PyArrow)¶
This is another extension data type for string data. It uses a columnar memory format.
s_string_pyarrow = pd.Series(['hello', 'world'], dtype=pd.StringDtype("pyarrow"))
display(s_string_pyarrow)
print(s_string_pyarrow.dtype)
"string[pyarrow]"
is an alias for pd.StringDtype("pyarrow")
.
s_string_pyarrow2 = pd.Series(['hello', 'world'], dtype="string[pyarrow]")
display(s_string_pyarrow2)
Q: Why does pandas
use the object
data type for strings?¶
pandas
is built on top of numpy
. numpy
uses a fixed-width string data type, similar to how C uses a char
array to represent a string. Using numpy
's fixed-width string representation would not bode well with data analysis. pandas
uses Python's native string data type as a workaround.
NumPy's byte-sized char
representation¶
arr_s6
array is a numpy array of strings with a maximum length of 5. The zero-terminated byte is not included in the character count.
arr_s5 = np.array(['hello', 'world'], dtype='|S5')
arr_s5
Attempting to assign a string that exceeds the maximum length will result in truncated elements.
arr_s5[1] = 'a whole new world'
arr_s5[1]
Using variable-width strings in NumPy¶
To use a variable-width string data type, use the object
type. NumPy will use the native Python string data type.
arr_obj = np.array(['hello', 'world'], dtype=object)
arr_obj
Attempting to assign a string that exceeds the maximum length will work without an issue.
arr_obj[1] = 'a whole new world'
arr_obj[1]
Question: Why does pandas
use the object
data type for strings?
Answer: Using a fixed-width string data type would be too limiting for most analytical applications. Pandas used Python-native string type to support variable-with strings before the string extension type was added.
Comparison 1: Memory efficiency¶
How does the three data pandas string data types stack up against each other in terms of memory efficiency?
Let's compare the memory usage of million strings with a uniform length of 8.
random_strings = np.random.randint(
low=10 ** 7,
high=10 ** 8,
size=100000
).astype(str)
random_strings[:10]
# object type
s_obj = pd.Series(random_strings)
print(f"dtype object uses {s_obj.memory_usage(deep=True)} bytes")
# StringDtype
# dtype="string" is an alias for dtype=pd.StringDtype()
s_string = pd.Series(
random_strings,
dtype="string"
)
print(f"dtype string uses {s_string.memory_usage(deep=True)} bytes")
# StringDtype with PyArrow
# dtype="string[pyarrow]" is an alias for dtype=pd.StringDtype("pyarrow")
s_string_pyarrow = pd.Series(
random_strings,
dtype="string[pyarrow]"
)
print(f"dtype string[pyarrow] uses {s_string_pyarrow.memory_usage(deep=True)} bytes")
import plotly.express as px
fig = px.bar(
x=['object', 'string', 'string[pyarrow]'],
y=[
s_obj.memory_usage(deep=True) / 10 ** 6,
s_string.memory_usage(deep=True) / 10 ** 6,
s_string_pyarrow.memory_usage(deep=True) / 10 ** 6
],
text=[
f"{round(s_obj.memory_usage(deep=True) / 10 ** 6, 1)} MB",
f"{round(s_string.memory_usage(deep=True) / 10 ** 6, 1)} MB",
f"{round(s_string_pyarrow.memory_usage(deep=True) / 10 ** 6, 1)} MB",
],
title='Pandas memory usages of million strings by data type (lower is better)',
template="simple_white"
)
fig.update_layout(
xaxis_title="Data Type",
yaxis_title="Memory Usage in MB",
)
fig.show()
Takeaways
object
andStringDtype
("string") types consume the same amount of memory.StringDtype
with PyArrow ("string[pyarrow]") is over 5x memory-efficient.- The result will also hold for variable-length strings.
Comparison 2: Casting Missing Values¶
Another noteworthy difference among the three string data types is how missing values are handled while casting to those types.
df_original
is a DataFrame with the following two columns:
- Column A: an object-typed column containing one text value and three differnet null-like values.
- Column B: a a float-typed column conatining three numeric values and one NaN value.
df_original = pd.DataFrame({
'A': ['String One', pd.NA, np.nan, None],
'B': [1, 2, np.nan, 4]
})
df_original
df_original.dtypes
Print columns as Python list
s to check whether each value has single qutoes around it.
df_original['A'].tolist()
df_original['B'].tolist()
Check cell-wise missing values.
df_original.isna()
df_original.isna().sum()
1. To Python's native string type (object
in pandas)¶
Cast to str
to make pandas use the Python native string type.
df_converted_object = df_original.astype({
'A': str,
'B': str
})
df_converted_object
Checking the data types displays object
for both columns.
df_converted_object.dtypes
Print columns as Python list
s to check whether each value has single qutoes around it.
df_converted_object['A'].tolist()
df_converted_object['B'].tolist()
df_converted_object.info()
Using the default type (object
) converts null-like values to strings! 🤯
pd.NA
becomes'<NA>'
.np.nan
becomes'nan'
.None
becomes'None'
Double-check that the missing values are no longer missing.
df_converted_object.isna()
df_converted_object.isna().sum()
This is an unexpected behavior for many users and a common pitfall when casting to str
. Many users have reported the issue on GitHub. The community (mainly @makbigc) has made efforts to fix the issue (Pull Request #28176) although the PR still hasn't been merged yet.
Is this a bug or a feature?¶
Users who run into this behavior will find it surprising. But is it really a bug? It's difficult to say since Pandas is simply replicating numpy's behavior when casting null-like values to Python's native string type.
# pandas
pd.Series([1, 2, np.nan, 4]).astype(str).tolist()
# numpy
np.array([1, 2, np.nan, 4]).astype(str)
If this is not a bug, this feels like a really, really, badly designed feature.
2. To string extension type (pd.StringDtype()
)¶
Cast to "string"
to make pandas use the string extension type.
df_converted_string = df_original.astype({
'A': "string",
'B': "string"
})
df_converted_string
Checking the data types displays string
for both columns.
df_converted_string.dtypes
Print columns as Python list
s to check whether each value has single qutoes around it.
df_converted_string['A'].tolist()
df_converted_string['B'].tolist()
df_converted_string.info()
The string extension type handles the null-like values as expected.
Double-check with isna()
.
df_converted_string.isna()
df_converted_string.isna().sum()
3. To string extension type with pyarrow (pd.StringDtype("pyarrow")
)¶
Cast to "string[pyarrow]"
to make pandas use the string extension type with PyArrow.
df_converted_string_pyarrow = df_original.astype({
'A': "string[pyarrow]",
'B': "string[pyarrow]"
})
df_converted_string_pyarrow
Checking the data types displays string
for both columns.
df_converted_string_pyarrow.dtypes
Print columns as Python list
s to check whether each value has single qutoes around it.
df_converted_string_pyarrow['A'].tolist()
df_converted_string_pyarrow['B'].tolist()
df_converted_string_pyarrow.info()
The string extension type with pyarrow also handles the null-like values as expected.
Double-check with isna()
.
df_converted_string_pyarrow.isna()
df_converted_string_pyarrow.isna().sum()
Comparison 3: Non-string Assignment¶
The final difference between the data types is how pandas handle non-string value assignments.
1. Python's native string type (object
in pandas)¶
Create an object
-typed Series with three elements.
s_object = pd.Series(['A', 'B', 'C'], dtype=object)
s_object
Assign two values - one bool
and another int
value.
s_object[1] = True
s_object[2] = 100
s_object
If a Series uses Python's native string type, Pandas silently converts the boolean and integer values into strings.
2. string extension types (pd.StringDtype()
and pd.StringDtype(storage="pyarrow")
)¶
Create a pd.StringDtype()
-typed Series with three elements.
s_string = pd.Series(['A', 'B', 'C'], dtype="string")
s_string
Assign two values - one bool
and another int
value.
s_string[1] = True
s_string[2] = 100
s_string
If a Series uses pandas' extension type (pd.StringDtype()
), Pandas throws an error when trying to assign non-string values.
Summary¶
dtype | Memory Efficiency | Casting Missing Values | Non-string Assignments |
---|---|---|---|
object | Same as "string" | Unexpected behavior - converts to string values instead of perserving the missing values | Silently converts to strings |
"string" | Same as "object" | Expected behavior - preserves missing values | Throws an error (type-safe) |
"string[pyarrow]" | Most efficient (by >5x) | Expected behavior - preserves missing values | Throws an error (type-safe) |