In [1]:
import sys
import pandas as pd
import numpy as np

print(f"python version {sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}")
print(f"pandas version {pd.__version__}")
print(f"numpy version {np.__version__}")
python version 3.10.11
pandas version 1.5.2
numpy version 1.23.5

The default data type for strings in Pandas DataFrames is the object type. However, pandas' documentation recommendeds explicitly using the StringDtype for storing strings as it's more efficient and allows for more specific string operations.

pd.StringDtype() is a dedicated data type for storing strings. It is an It allows for more specific string operations. StringDtype is still considered experimental as of pandas 2.0.1.

pandas can utilize PyArrow with StringDtype by using pd.StringDtype(storage="pyarrow").

In [2]:
pd.StringDtype()
Out[2]:
string[python]
In [3]:
pd.StringDtype(storage="pyarrow")
Out[3]:
string[pyarrow]

object types are a more general data type that can store any type of data, including strings. However, object types are less efficient for storing strings, and they do not allow for specific string operations.

Despite StringDtype being recommended by the official Pandas documentation, distinguishing between StringDType and object is rarely required. Having to specify a speicfic string data type feels like a nuisance. When does it actually matter?

String

Pandas string types

There are three string data types available in pandas (as of version 2.0.0).

1. object

This is the default data type for strings.

In [4]:
s_object = pd.Series(['hello', 'world'])
display(s_object)
0    hello
1    world
dtype: object
In [5]:
print(s_object.dtype)
object

2. StringDtype

This is an extension data type for string data and was introduced in pandas 1.0.

In [6]:
s_string = pd.Series(['hello', 'world'], dtype=pd.StringDtype())
display(s_string)
0    hello
1    world
dtype: string
In [7]:
print(s_string.dtype)
string

"string" is an alias for pd.StringDtype().

In [8]:
s_string2 = pd.Series(['hello', 'world'], dtype=pd.StringDtype())
display(s_string2)
0    hello
1    world
dtype: string

3. StringDtype (PyArrow)

This is another extension data type for string data. It uses a columnar memory format.

In [9]:
s_string_pyarrow = pd.Series(['hello', 'world'], dtype=pd.StringDtype("pyarrow"))
display(s_string_pyarrow)
0    hello
1    world
dtype: string
In [10]:
print(s_string_pyarrow.dtype)
string

"string[pyarrow]" is an alias for pd.StringDtype("pyarrow").

In [11]:
s_string_pyarrow2 = pd.Series(['hello', 'world'], dtype="string[pyarrow]")
display(s_string_pyarrow2)
0    hello
1    world
dtype: string

Q: Why does pandas use the object data type for strings?

pandas is built on top of numpy. numpy uses a fixed-width string data type, similar to how C uses a char array to represent a string. Using numpy's fixed-width string representation would not bode well with data analysis. pandas uses Python's native string data type as a workaround.

2023-05-26 04_46_59-analytics-master-slides - Google Slides

NumPy's byte-sized char representation

arr_s6 array is a numpy array of strings with a maximum length of 5. The zero-terminated byte is not included in the character count.

In [12]:
arr_s5 = np.array(['hello', 'world'], dtype='|S5')
arr_s5
Out[12]:
array([b'hello', b'world'], dtype='|S5')

Attempting to assign a string that exceeds the maximum length will result in truncated elements.

In [13]:
arr_s5[1] = 'a whole new world'
arr_s5[1]
Out[13]:
b'a who'

Using variable-width strings in NumPy

To use a variable-width string data type, use the object type. NumPy will use the native Python string data type.

In [14]:
arr_obj = np.array(['hello', 'world'], dtype=object)
arr_obj
Out[14]:
array(['hello', 'world'], dtype=object)

Attempting to assign a string that exceeds the maximum length will work without an issue.

In [15]:
arr_obj[1] = 'a whole new world'
arr_obj[1]
Out[15]:
'a whole new world'

Question: Why does pandas use the object data type for strings?

Answer: Using a fixed-width string data type would be too limiting for most analytical applications. Pandas used Python-native string type to support variable-with strings before the string extension type was added.

Comparison 1: Memory efficiency

How does the three data pandas string data types stack up against each other in terms of memory efficiency?

Let's compare the memory usage of million strings with a uniform length of 8.

In [16]:
random_strings = np.random.randint(
    low=10 ** 7,
    high=10 ** 8,
    size=100000
).astype(str)

random_strings[:10]
Out[16]:
array(['15589323', '20800295', '41036913', '68823562', '51830538',
       '76865317', '59543769', '15347449', '51126384', '25913394'],
      dtype='<U11')
In [17]:
# object type
s_obj = pd.Series(random_strings)
print(f"dtype object uses {s_obj.memory_usage(deep=True)} bytes")

# StringDtype
# dtype="string" is an alias for dtype=pd.StringDtype()
s_string = pd.Series(
    random_strings,
    dtype="string"
)
print(f"dtype string uses {s_string.memory_usage(deep=True)} bytes")

# StringDtype with PyArrow
# dtype="string[pyarrow]" is an alias for dtype=pd.StringDtype("pyarrow")
s_string_pyarrow = pd.Series(
    random_strings,
    dtype="string[pyarrow]"
)
print(f"dtype string[pyarrow] uses {s_string_pyarrow.memory_usage(deep=True)} bytes")
dtype object uses 6500128 bytes
dtype string uses 6500128 bytes
dtype string[pyarrow] uses 1200128 bytes
In [18]:
import plotly.express as px

fig = px.bar(
    x=['object', 'string', 'string[pyarrow]'],
    y=[
        s_obj.memory_usage(deep=True) / 10 ** 6,
        s_string.memory_usage(deep=True) / 10 ** 6,
        s_string_pyarrow.memory_usage(deep=True) / 10 ** 6
    ],
    text=[
        f"{round(s_obj.memory_usage(deep=True) / 10 ** 6, 1)} MB",
        f"{round(s_string.memory_usage(deep=True) / 10 ** 6, 1)} MB",
        f"{round(s_string_pyarrow.memory_usage(deep=True) / 10 ** 6, 1)} MB",
    ],
    title='Pandas memory usages of million strings by data type (lower is better)',
    template="simple_white"
)

fig.update_layout(
    xaxis_title="Data Type",
    yaxis_title="Memory Usage in MB",
)

fig.show()

Takeaways

  • object and StringDtype ("string") types consume the same amount of memory.
  • StringDtype with PyArrow ("string[pyarrow]") is over 5x memory-efficient.
  • The result will also hold for variable-length strings.

Comparison 2: Casting Missing Values

Another noteworthy difference among the three string data types is how missing values are handled while casting to those types.

df_original is a DataFrame with the following two columns:

  • Column A: an object-typed column containing one text value and three differnet null-like values.
  • Column B: a a float-typed column conatining three numeric values and one NaN value.
In [19]:
df_original = pd.DataFrame({
    'A': ['String One', pd.NA, np.nan, None],
    'B': [1, 2, np.nan, 4]
})

df_original
Out[19]:
AB
0String One1.0
1<NA>2.0
2NaNNaN
3None4.0
In [20]:
df_original.dtypes
Out[20]:
A     object
B    float64
dtype: object

Print columns as Python lists to check whether each value has single qutoes around it.

In [21]:
df_original['A'].tolist()
Out[21]:
['String One', <NA>, nan, None]
In [22]:
df_original['B'].tolist()
Out[22]:
[1.0, 2.0, nan, 4.0]

Check cell-wise missing values.

In [23]:
df_original.isna()
Out[23]:
AB
0FalseFalse
1TrueFalse
2TrueTrue
3TrueFalse
In [24]:
df_original.isna().sum()
Out[24]:
A    3
B    1
dtype: int64

1. To Python's native string type (object in pandas)

Cast to str to make pandas use the Python native string type.

In [25]:
df_converted_object = df_original.astype({
    'A': str,
    'B': str
})

df_converted_object
Out[25]:
AB
0String One1.0
1<NA>2.0
2nannan
3None4.0

Checking the data types displays object for both columns.

In [26]:
df_converted_object.dtypes
Out[26]:
A    object
B    object
dtype: object

Print columns as Python lists to check whether each value has single qutoes around it.

In [27]:
df_converted_object['A'].tolist()
Out[27]:
['String One', '<NA>', 'nan', 'None']
In [28]:
df_converted_object['B'].tolist()
Out[28]:
['1.0', '2.0', 'nan', '4.0']
In [29]:
df_converted_object.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   A       4 non-null      object
 1   B       4 non-null      object
dtypes: object(2)
memory usage: 192.0+ bytes

Using the default type (object) converts null-like values to strings! 🤯

  • pd.NA becomes '<NA>'.
  • np.nan becomes 'nan'.
  • None becomes 'None'

Double-check that the missing values are no longer missing.

In [30]:
df_converted_object.isna()
Out[30]:
AB
0FalseFalse
1FalseFalse
2FalseFalse
3FalseFalse
In [31]:
df_converted_object.isna().sum()
Out[31]:
A    0
B    0
dtype: int64

This is an unexpected behavior for many users and a common pitfall when casting to str. Many users have reported the issue on GitHub. The community (mainly @makbigc) has made efforts to fix the issue (Pull Request #28176) although the PR still hasn't been merged yet.

Is this a bug or a feature?

Users who run into this behavior will find it surprising. But is it really a bug? It's difficult to say since Pandas is simply replicating numpy's behavior when casting null-like values to Python's native string type.

In [32]:
# pandas
pd.Series([1, 2, np.nan, 4]).astype(str).tolist()
Out[32]:
['1.0', '2.0', 'nan', '4.0']
In [33]:
# numpy
np.array([1, 2, np.nan, 4]).astype(str)
Out[33]:
array(['1.0', '2.0', 'nan', '4.0'], dtype='<U32')

If this is not a bug, this feels like a really, really, badly designed feature.

2. To string extension type (pd.StringDtype())

Cast to "string" to make pandas use the string extension type.

In [34]:
df_converted_string = df_original.astype({
    'A': "string",
    'B': "string"
})

df_converted_string
Out[34]:
AB
0String One1.0
1<NA>2.0
2<NA><NA>
3<NA>4.0

Checking the data types displays string for both columns.

In [35]:
df_converted_string.dtypes
Out[35]:
A    string
B    string
dtype: object

Print columns as Python lists to check whether each value has single qutoes around it.

In [36]:
df_converted_string['A'].tolist()
Out[36]:
['String One', <NA>, <NA>, <NA>]
In [37]:
df_converted_string['B'].tolist()
Out[37]:
['1.0', '2.0', <NA>, '4.0']
In [38]:
df_converted_string.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   A       1 non-null      string
 1   B       3 non-null      string
dtypes: string(2)
memory usage: 192.0 bytes

The string extension type handles the null-like values as expected.

Double-check with isna().

In [39]:
df_converted_string.isna()
Out[39]:
AB
0FalseFalse
1TrueFalse
2TrueTrue
3TrueFalse
In [40]:
df_converted_string.isna().sum()
Out[40]:
A    3
B    1
dtype: int64

3. To string extension type with pyarrow (pd.StringDtype("pyarrow"))

Cast to "string[pyarrow]" to make pandas use the string extension type with PyArrow.

In [41]:
df_converted_string_pyarrow = df_original.astype({
    'A': "string[pyarrow]",
    'B': "string[pyarrow]"
})

df_converted_string_pyarrow
Out[41]:
AB
0String One1.0
1<NA>2.0
2<NA><NA>
3<NA>4.0

Checking the data types displays string for both columns.

In [42]:
df_converted_string_pyarrow.dtypes
Out[42]:
A    string
B    string
dtype: object

Print columns as Python lists to check whether each value has single qutoes around it.

In [43]:
df_converted_string_pyarrow['A'].tolist()
Out[43]:
['String One', <NA>, <NA>, <NA>]
In [44]:
df_converted_string_pyarrow['B'].tolist()
Out[44]:
['1.0', '2.0', <NA>, '4.0']
In [45]:
df_converted_string_pyarrow.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   A       1 non-null      string
 1   B       3 non-null      string
dtypes: string(2)
memory usage: 181.0 bytes

The string extension type with pyarrow also handles the null-like values as expected.

Double-check with isna().

In [46]:
df_converted_string_pyarrow.isna()
Out[46]:
AB
0FalseFalse
1TrueFalse
2TrueTrue
3TrueFalse
In [47]:
df_converted_string_pyarrow.isna().sum()
Out[47]:
A    3
B    1
dtype: int64

Comparison 3: Non-string Assignment

The final difference between the data types is how pandas handle non-string value assignments.

1. Python's native string type (object in pandas)

Create an object-typed Series with three elements.

In [48]:
s_object = pd.Series(['A', 'B', 'C'], dtype=object)
s_object
Out[48]:
0    A
1    B
2    C
dtype: object

Assign two values - one bool and another int value.

In [49]:
s_object[1] = True
s_object[2] = 100
s_object
Out[49]:
0       A
1    True
2     100
dtype: object

If a Series uses Python's native string type, Pandas silently converts the boolean and integer values into strings.

2. string extension types (pd.StringDtype() and pd.StringDtype(storage="pyarrow"))

Create a pd.StringDtype()-typed Series with three elements.

In [50]:
s_string = pd.Series(['A', 'B', 'C'], dtype="string")
s_string
Out[50]:
0    A
1    B
2    C
dtype: string

Assign two values - one bool and another int value.

In [51]:
s_string[1] = True
s_string[2] = 100
s_string
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\series.py:1105, in Series.__setitem__(self, key, value)
   1104 try:
-> 1105     self._set_with_engine(key, value)
   1106 except KeyError:
   1107     # We have a scalar (or for MultiIndex or object-dtype, scalar-like)
   1108     #  key that is not present in self.index.

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\series.py:1178, in Series._set_with_engine(self, key, value)
   1177 # this is equivalent to self._values[key] = value
-> 1178 self._mgr.setitem_inplace(loc, value)

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\internals\managers.py:2099, in SingleBlockManager.setitem_inplace(self, indexer, value)
   2097     self._cache.clear()
-> 2099 super().setitem_inplace(indexer, value)

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\internals\base.py:190, in SingleDataManager.setitem_inplace(self, indexer, value)
    188     value = np_can_hold_element(arr.dtype, value)
--> 190 arr[indexer] = value

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\arrays\string_.py:403, in StringArray.__setitem__(self, key, value)
    402     elif not isinstance(value, str):
--> 403         raise ValueError(
    404             f"Cannot set non-string value '{value}' into a StringArray."
    405         )
    406 else:

ValueError: Cannot set non-string value 'True' into a StringArray.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[51], line 1
----> 1 s_string[1] = True
      2 s_string[2] = 100
      3 s_string

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\series.py:1132, in Series.__setitem__(self, key, value)
   1129 except (TypeError, ValueError, LossySetitemError):
   1130     # The key was OK, but we cannot set the value losslessly
   1131     indexer = self.index.get_loc(key)
-> 1132     self._set_values(indexer, value)
   1134 except InvalidIndexError as err:
   1135     if isinstance(key, tuple) and not isinstance(self.index, MultiIndex):
   1136         # cases with MultiIndex don't get here bc they raise KeyError
   1137         # e.g. test_basic_getitem_setitem_corner

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\series.py:1215, in Series._set_values(self, key, value)
   1212 if isinstance(key, (Index, Series)):
   1213     key = key._values
-> 1215 self._mgr = self._mgr.setitem(indexer=key, value=value)
   1216 self._maybe_update_cacher()

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\internals\managers.py:393, in BaseBlockManager.setitem(self, indexer, value)
    388 if _using_copy_on_write() and not self._has_no_reference(0):
    389     # if being referenced -> perform Copy-on-Write and clear the reference
    390     # this method is only called if there is a single block -> hardcoded 0
    391     self = self.copy()
--> 393 return self.apply("setitem", indexer=indexer, value=value)

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\internals\managers.py:352, in BaseBlockManager.apply(self, f, align_keys, ignore_failures, **kwargs)
    350         applied = b.apply(f, **kwargs)
    351     else:
--> 352         applied = getattr(b, f)(**kwargs)
    353 except (TypeError, NotImplementedError):
    354     if not ignore_failures:

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\internals\blocks.py:1419, in EABackedBlock.setitem(self, indexer, value)
   1417     values[indexer] = value
   1418 except (ValueError, TypeError) as err:
-> 1419     _catch_deprecated_value_error(err)
   1421     if is_interval_dtype(self.dtype):
   1422         # see TestSetitemFloatIntervalWithIntIntervalValues
   1423         nb = self.coerce_to_target_dtype(orig_value)

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\internals\blocks.py:1417, in EABackedBlock.setitem(self, indexer, value)
   1414 check_setitem_lengths(indexer, value, values)
   1416 try:
-> 1417     values[indexer] = value
   1418 except (ValueError, TypeError) as err:
   1419     _catch_deprecated_value_error(err)

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\arrays\string_.py:403, in StringArray.__setitem__(self, key, value)
    401         value = libmissing.NA
    402     elif not isinstance(value, str):
--> 403         raise ValueError(
    404             f"Cannot set non-string value '{value}' into a StringArray."
    405         )
    406 else:
    407     if not is_array_like(value):

ValueError: Cannot set non-string value 'True' into a StringArray.

If a Series uses pandas' extension type (pd.StringDtype()), Pandas throws an error when trying to assign non-string values.

Summary

dtypeMemory EfficiencyCasting Missing ValuesNon-string Assignments
objectSame as "string"Unexpected behavior - converts to string values instead of perserving the missing valuesSilently converts to strings
"string"Same as "object"Expected behavior - preserves missing valuesThrows an error (type-safe)
"string[pyarrow]"Most efficient (by >5x)Expected behavior - preserves missing valuesThrows an error (type-safe)