In [1]:

import sys
import pandas as pd
import numpy as np

print(f"python version {sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}")
print(f"pandas version {pd.__version__}")
print(f"numpy version {np.__version__}")

python version 3.10.11
pandas version 1.5.2
numpy version 1.23.5

The default data type for strings in Pandas DataFrames is the object type. However, pandas' documentation recommendeds explicitly using the StringDtype for storing strings as it's more efficient and allows for more specific string operations.

pd.StringDtype() is a dedicated data type for storing strings. It is an It allows for more specific string operations. StringDtype is still considered experimental as of pandas 2.0.1.

pandas can utilize PyArrow with StringDtype by using pd.StringDtype(storage="pyarrow").

In [2]:

pd.StringDtype()

Out[2]:

string[python]

In [3]:

pd.StringDtype(storage="pyarrow")

Out[3]:

string[pyarrow]

object types are a more general data type that can store any type of data, including strings. However, object types are less efficient for storing strings, and they do not allow for specific string operations.

Despite StringDtype being recommended by the official Pandas documentation, distinguishing between StringDType and object is rarely required. Having to specify a speicfic string data type feels like a nuisance. When does it actually matter?

String

Pandas string types¶

There are three string data types available in pandas (as of version 2.0.0).

1. `object`¶

This is the default data type for strings.

In [4]:

s_object = pd.Series(['hello', 'world'])
display(s_object)

0    hello
1    world
dtype: object

In [5]:

print(s_object.dtype)

object

2. `StringDtype`¶

This is an extension data type for string data and was introduced in pandas 1.0.

In [6]:

s_string = pd.Series(['hello', 'world'], dtype=pd.StringDtype())
display(s_string)

0    hello
1    world
dtype: string

In [7]:

print(s_string.dtype)

string

"string" is an alias for pd.StringDtype().

In [8]:

s_string2 = pd.Series(['hello', 'world'], dtype=pd.StringDtype())
display(s_string2)

0    hello
1    world
dtype: string

3. `StringDtype` (PyArrow)¶

This is another extension data type for string data. It uses a columnar memory format.

In [9]:

s_string_pyarrow = pd.Series(['hello', 'world'], dtype=pd.StringDtype("pyarrow"))
display(s_string_pyarrow)

0    hello
1    world
dtype: string

In [10]:

print(s_string_pyarrow.dtype)

string

"string[pyarrow]" is an alias for pd.StringDtype("pyarrow").

In [11]:

s_string_pyarrow2 = pd.Series(['hello', 'world'], dtype="string[pyarrow]")
display(s_string_pyarrow2)

0    hello
1    world
dtype: string

Q: Why does `pandas` use the `object` data type for strings?¶

pandas is built on top of numpy. numpy uses a fixed-width string data type, similar to how C uses a char array to represent a string. Using numpy's fixed-width string representation would not bode well with data analysis. pandas uses Python's native string data type as a workaround.

2023-05-26 04_46_59-analytics-master-slides - Google Slides

NumPy's byte-sized `char` representation¶

arr_s6 array is a numpy array of strings with a maximum length of 5. The zero-terminated byte is not included in the character count.

In [12]:

arr_s5 = np.array(['hello', 'world'], dtype='|S5')
arr_s5

Out[12]:

array([b'hello', b'world'], dtype='|S5')

Attempting to assign a string that exceeds the maximum length will result in truncated elements.

In [13]:

arr_s5[1] = 'a whole new world'
arr_s5[1]

Out[13]:

b'a who'

Using variable-width strings in NumPy¶

To use a variable-width string data type, use the object type. NumPy will use the native Python string data type.

In [14]:

arr_obj = np.array(['hello', 'world'], dtype=object)
arr_obj

Out[14]:

array(['hello', 'world'], dtype=object)

Attempting to assign a string that exceeds the maximum length will work without an issue.

In [15]:

arr_obj[1] = 'a whole new world'
arr_obj[1]

Out[15]:

'a whole new world'

Question: Why does pandas use the object data type for strings?

Answer: Using a fixed-width string data type would be too limiting for most analytical applications. Pandas used Python-native string type to support variable-with strings before the string extension type was added.

Comparison 1: Memory efficiency¶

How does the three data pandas string data types stack up against each other in terms of memory efficiency?

Let's compare the memory usage of million strings with a uniform length of 8.

In [16]:

random_strings = np.random.randint(
    low=10 ** 7,
    high=10 ** 8,
    size=100000
).astype(str)

random_strings[:10]

Out[16]:

array(['15589323', '20800295', '41036913', '68823562', '51830538',
       '76865317', '59543769', '15347449', '51126384', '25913394'],
      dtype='<U11')

In [17]:

# object type
s_obj = pd.Series(random_strings)
print(f"dtype object uses {s_obj.memory_usage(deep=True)} bytes")

# StringDtype
# dtype="string" is an alias for dtype=pd.StringDtype()
s_string = pd.Series(
    random_strings,
    dtype="string"
)
print(f"dtype string uses {s_string.memory_usage(deep=True)} bytes")

# StringDtype with PyArrow
# dtype="string[pyarrow]" is an alias for dtype=pd.StringDtype("pyarrow")
s_string_pyarrow = pd.Series(
    random_strings,
    dtype="string[pyarrow]"
)
print(f"dtype string[pyarrow] uses {s_string_pyarrow.memory_usage(deep=True)} bytes")

dtype object uses 6500128 bytes
dtype string uses 6500128 bytes
dtype string[pyarrow] uses 1200128 bytes

In [18]:

import plotly.express as px

fig = px.bar(
    x=['object', 'string', 'string[pyarrow]'],
    y=[
        s_obj.memory_usage(deep=True) / 10 ** 6,
        s_string.memory_usage(deep=True) / 10 ** 6,
        s_string_pyarrow.memory_usage(deep=True) / 10 ** 6
    ],
    text=[
        f"{round(s_obj.memory_usage(deep=True) / 10 ** 6, 1)} MB",
        f"{round(s_string.memory_usage(deep=True) / 10 ** 6, 1)} MB",
        f"{round(s_string_pyarrow.memory_usage(deep=True) / 10 ** 6, 1)} MB",
    ],
    title='Pandas memory usages of million strings by data type (lower is better)',
    template="simple_white"
)

fig.update_layout(
    xaxis_title="Data Type",
    yaxis_title="Memory Usage in MB",
)

fig.show()

Takeaways

object and StringDtype ("string") types consume the same amount of memory.
StringDtype with PyArrow ("string[pyarrow]") is over 5x memory-efficient.
The result will also hold for variable-length strings.

Comparison 2: Casting Missing Values¶

Another noteworthy difference among the three string data types is how missing values are handled while casting to those types.

df_original is a DataFrame with the following two columns:

Column A: an object-typed column containing one text value and three differnet null-like values.
Column B: a a float-typed column conatining three numeric values and one NaN value.

In [19]:

df_original = pd.DataFrame({
    'A': ['String One', pd.NA, np.nan, None],
    'B': [1, 2, np.nan, 4]
})

df_original

Out[19]:

	A	B
0	String One	1.0
1	<NA>	2.0
2	NaN	NaN
3	None	4.0

In [20]:

df_original.dtypes

Out[20]:

A     object
B    float64
dtype: object

Print columns as Python lists to check whether each value has single qutoes around it.

In [21]:

df_original['A'].tolist()

Out[21]:

['String One', <NA>, nan, None]

In [22]:

df_original['B'].tolist()

Out[22]:

[1.0, 2.0, nan, 4.0]

Check cell-wise missing values.

In [23]:

df_original.isna()

Out[23]:

	A	B
0	False	False
1	True	False
2	True	True
3	True	False

In [24]:

df_original.isna().sum()

Out[24]:

A    3
B    1
dtype: int64

1. To Python's native string type (`object` in pandas)¶

Cast to str to make pandas use the Python native string type.

In [25]:

df_converted_object = df_original.astype({
    'A': str,
    'B': str
})

df_converted_object

Out[25]:

	A	B
0	String One	1.0
1	<NA>	2.0
2	nan	nan
3	None	4.0

Checking the data types displays object for both columns.

In [26]:

df_converted_object.dtypes

Out[26]:

A    object
B    object
dtype: object

Print columns as Python lists to check whether each value has single qutoes around it.

In [27]:

df_converted_object['A'].tolist()

Out[27]:

['String One', '<NA>', 'nan', 'None']

In [28]:

df_converted_object['B'].tolist()

Out[28]:

['1.0', '2.0', 'nan', '4.0']

In [29]:

df_converted_object.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   A       4 non-null      object
 1   B       4 non-null      object
dtypes: object(2)
memory usage: 192.0+ bytes

Using the default type (object) converts null-like values to strings! 🤯

pd.NA becomes '<NA>'.
np.nan becomes 'nan'.
None becomes 'None'

Double-check that the missing values are no longer missing.

In [30]:

df_converted_object.isna()

Out[30]:

	A	B
0	False	False
1	False	False
2	False	False
3	False	False

In [31]:

df_converted_object.isna().sum()

Out[31]:

A    0
B    0
dtype: int64

This is an unexpected behavior for many users and a common pitfall when casting to str. Many users have reported the issue on GitHub. The community (mainly @makbigc) has made efforts to fix the issue (Pull Request #28176) although the PR still hasn't been merged yet.

Is this a bug or a feature?¶

Users who run into this behavior will find it surprising. But is it really a bug? It's difficult to say since Pandas is simply replicating numpy's behavior when casting null-like values to Python's native string type.

In [32]:

# pandas
pd.Series([1, 2, np.nan, 4]).astype(str).tolist()

Out[32]:

['1.0', '2.0', 'nan', '4.0']

In [33]:

# numpy
np.array([1, 2, np.nan, 4]).astype(str)

Out[33]:

array(['1.0', '2.0', 'nan', '4.0'], dtype='<U32')

If this is not a bug, this feels like a really, really, badly designed feature.

2. To string extension type (`pd.StringDtype()`)¶

Cast to "string" to make pandas use the string extension type.

In [34]:

df_converted_string = df_original.astype({
    'A': "string",
    'B': "string"
})

df_converted_string

Out[34]:

	A	B
0	String One	1.0
1	<NA>	2.0
2	<NA>	<NA>
3	<NA>	4.0

Checking the data types displays string for both columns.

In [35]:

df_converted_string.dtypes

Out[35]:

A    string
B    string
dtype: object

Print columns as Python lists to check whether each value has single qutoes around it.

In [36]:

df_converted_string['A'].tolist()

Out[36]:

['String One', <NA>, <NA>, <NA>]

In [37]:

df_converted_string['B'].tolist()

Out[37]:

['1.0', '2.0', <NA>, '4.0']

In [38]:

df_converted_string.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   A       1 non-null      string
 1   B       3 non-null      string
dtypes: string(2)
memory usage: 192.0 bytes

The string extension type handles the null-like values as expected.

Double-check with isna().

In [39]:

df_converted_string.isna()

Out[39]:

	A	B
0	False	False
1	True	False
2	True	True
3	True	False

In [40]:

df_converted_string.isna().sum()

Out[40]:

A    3
B    1
dtype: int64

3. To string extension type with pyarrow (`pd.StringDtype("pyarrow")`)¶

Cast to "string[pyarrow]" to make pandas use the string extension type with PyArrow.

In [41]:

df_converted_string_pyarrow = df_original.astype({
    'A': "string[pyarrow]",
    'B': "string[pyarrow]"
})

df_converted_string_pyarrow

Out[41]:

	A	B
0	String One	1.0
1	<NA>	2.0
2	<NA>	<NA>
3	<NA>	4.0

Checking the data types displays string for both columns.

In [42]:

df_converted_string_pyarrow.dtypes

Out[42]:

A    string
B    string
dtype: object

Print columns as Python lists to check whether each value has single qutoes around it.

In [43]:

df_converted_string_pyarrow['A'].tolist()

Out[43]:

['String One', <NA>, <NA>, <NA>]

In [44]:

df_converted_string_pyarrow['B'].tolist()

Out[44]:

['1.0', '2.0', <NA>, '4.0']

In [45]:

df_converted_string_pyarrow.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   A       1 non-null      string
 1   B       3 non-null      string
dtypes: string(2)
memory usage: 181.0 bytes

The string extension type with pyarrow also handles the null-like values as expected.

Double-check with isna().

In [46]:

df_converted_string_pyarrow.isna()

Out[46]:

	A	B
0	False	False
1	True	False
2	True	True
3	True	False

In [47]:

df_converted_string_pyarrow.isna().sum()

Out[47]:

A    3
B    1
dtype: int64

Comparison 3: Non-string Assignment¶

The final difference between the data types is how pandas handle non-string value assignments.

1. Python's native string type (`object` in pandas)¶

Create an object-typed Series with three elements.

In [48]:

s_object = pd.Series(['A', 'B', 'C'], dtype=object)
s_object

Out[48]:

0    A
1    B
2    C
dtype: object

Assign two values - one bool and another int value.

In [49]:

s_object[1] = True
s_object[2] = 100
s_object

Out[49]:

0       A
1    True
2     100
dtype: object

If a Series uses Python's native string type, Pandas silently converts the boolean and integer values into strings.

2. string extension types (`pd.StringDtype()` and `pd.StringDtype(storage="pyarrow")`)¶

Create a pd.StringDtype()-typed Series with three elements.

In [50]:

s_string = pd.Series(['A', 'B', 'C'], dtype="string")
s_string

Out[50]:

0    A
1    B
2    C
dtype: string

Assign two values - one bool and another int value.

In [51]:

s_string[1] = True
s_string[2] = 100
s_string

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\series.py:1105, in Series.__setitem__(self, key, value)
   1104 try:
-> 1105     self._set_with_engine(key, value)
   1106 except KeyError:
   1107     # We have a scalar (or for MultiIndex or object-dtype, scalar-like)
   1108     #  key that is not present in self.index.

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\series.py:1178, in Series._set_with_engine(self, key, value)
   1177 # this is equivalent to self._values[key] = value
-> 1178 self._mgr.setitem_inplace(loc, value)

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\internals\managers.py:2099, in SingleBlockManager.setitem_inplace(self, indexer, value)
   2097     self._cache.clear()
-> 2099 super().setitem_inplace(indexer, value)

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\internals\base.py:190, in SingleDataManager.setitem_inplace(self, indexer, value)
    188     value = np_can_hold_element(arr.dtype, value)
--> 190 arr[indexer] = value

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\arrays\string_.py:403, in StringArray.__setitem__(self, key, value)
    402     elif not isinstance(value, str):
--> 403         raise ValueError(
    404             f"Cannot set non-string value '{value}' into a StringArray."
    405         )
    406 else:

ValueError: Cannot set non-string value 'True' into a StringArray.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[51], line 1
----> 1 s_string[1] = True
      2 s_string[2] = 100
      3 s_string

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\series.py:1132, in Series.__setitem__(self, key, value)
   1129 except (TypeError, ValueError, LossySetitemError):
   1130     # The key was OK, but we cannot set the value losslessly
   1131     indexer = self.index.get_loc(key)
-> 1132     self._set_values(indexer, value)
   1134 except InvalidIndexError as err:
   1135     if isinstance(key, tuple) and not isinstance(self.index, MultiIndex):
   1136         # cases with MultiIndex don't get here bc they raise KeyError
   1137         # e.g. test_basic_getitem_setitem_corner

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\series.py:1215, in Series._set_values(self, key, value)
   1212 if isinstance(key, (Index, Series)):
   1213     key = key._values
-> 1215 self._mgr = self._mgr.setitem(indexer=key, value=value)
   1216 self._maybe_update_cacher()

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\internals\managers.py:393, in BaseBlockManager.setitem(self, indexer, value)
    388 if _using_copy_on_write() and not self._has_no_reference(0):
    389     # if being referenced -> perform Copy-on-Write and clear the reference
    390     # this method is only called if there is a single block -> hardcoded 0
    391     self = self.copy()
--> 393 return self.apply("setitem", indexer=indexer, value=value)

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\internals\managers.py:352, in BaseBlockManager.apply(self, f, align_keys, ignore_failures, **kwargs)
    350         applied = b.apply(f, **kwargs)
    351     else:
--> 352         applied = getattr(b, f)(**kwargs)
    353 except (TypeError, NotImplementedError):
    354     if not ignore_failures:

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\internals\blocks.py:1419, in EABackedBlock.setitem(self, indexer, value)
   1417     values[indexer] = value
   1418 except (ValueError, TypeError) as err:
-> 1419     _catch_deprecated_value_error(err)
   1421     if is_interval_dtype(self.dtype):
   1422         # see TestSetitemFloatIntervalWithIntIntervalValues
   1423         nb = self.coerce_to_target_dtype(orig_value)

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\internals\blocks.py:1417, in EABackedBlock.setitem(self, indexer, value)
   1414 check_setitem_lengths(indexer, value, values)
   1416 try:
-> 1417     values[indexer] = value
   1418 except (ValueError, TypeError) as err:
   1419     _catch_deprecated_value_error(err)

File ~\miniconda3\envs\sp2023\lib\site-packages\pandas\core\arrays\string_.py:403, in StringArray.__setitem__(self, key, value)
    401         value = libmissing.NA
    402     elif not isinstance(value, str):
--> 403         raise ValueError(
    404             f"Cannot set non-string value '{value}' into a StringArray."
    405         )
    406 else:
    407     if not is_array_like(value):

ValueError: Cannot set non-string value 'True' into a StringArray.

If a Series uses pandas' extension type (pd.StringDtype()), Pandas throws an error when trying to assign non-string values.

Summary¶

dtype	Memory Efficiency	Casting Missing Values	Non-string Assignments
`object`	Same as `"string"`	Unexpected behavior - converts to string values instead of perserving the missing values	Silently converts to strings
`"string"`	Same as `"object"`	Expected behavior - preserves missing values	Throws an error (type-safe)
`"string[pyarrow]"`	Most efficient (by >5x)	Expected behavior - preserves missing values	Throws an error (type-safe)

An In-depth Comparison of Pandas String dtypes

Pandas string types¶

1. `object`¶

2. `StringDtype`¶

3. `StringDtype` (PyArrow)¶

Q: Why does `pandas` use the `object` data type for strings?¶

NumPy's byte-sized `char` representation¶

Using variable-width strings in NumPy¶

Comparison 1: Memory efficiency¶

Comparison 2: Casting Missing Values¶

1. To Python's native string type (`object` in pandas)¶

Is this a bug or a feature?¶

2. To string extension type (`pd.StringDtype()`)¶

3. To string extension type with pyarrow (`pd.StringDtype("pyarrow")`)¶

Comparison 3: Non-string Assignment¶

1. Python's native string type (`object` in pandas)¶

2. string extension types (`pd.StringDtype()` and `pd.StringDtype(storage="pyarrow")`)¶

Summary¶

References¶

An In-depth Comparison of Pandas String dtypes

Pandas string types¶

1. object¶

2. StringDtype¶

3. StringDtype (PyArrow)¶

Q: Why does pandas use the object data type for strings?¶

NumPy's byte-sized char representation¶

Using variable-width strings in NumPy¶

Comparison 1: Memory efficiency¶

Comparison 2: Casting Missing Values¶

1. To Python's native string type (object in pandas)¶

Is this a bug or a feature?¶

2. To string extension type (pd.StringDtype())¶

3. To string extension type with pyarrow (pd.StringDtype("pyarrow"))¶

Comparison 3: Non-string Assignment¶

1. Python's native string type (object in pandas)¶

2. string extension types (pd.StringDtype() and pd.StringDtype(storage="pyarrow"))¶

Summary¶

References¶

1. `object`¶

2. `StringDtype`¶

3. `StringDtype` (PyArrow)¶

Q: Why does `pandas` use the `object` data type for strings?¶

NumPy's byte-sized `char` representation¶

1. To Python's native string type (`object` in pandas)¶

2. To string extension type (`pd.StringDtype()`)¶

3. To string extension type with pyarrow (`pd.StringDtype("pyarrow")`)¶

1. Python's native string type (`object` in pandas)¶

2. string extension types (`pd.StringDtype()` and `pd.StringDtype(storage="pyarrow")`)¶