lab/memory_utilization.md at 417326498e135fc583063b9908a440f8887bc12d

sek1ro/lab

Fork 0

Files

SEK1RO 417326498e ds: r9

2026-01-26 20:58:26 +03:00

3.8 KiB

Raw Blame History

Memory Utilization

Memory utilization on a DataFrame depends largely on the date types for each column.

We can use DataFrame.memory_usage() to see the memory usage for each column (in bytes). Most of the common data types have a fixed size in memory, such as int, float, datetime, and bool. Memory usage for these data types is the respective memory requirement multiplied by the number of data points. For string data types, the memory usage reported is the number of data points times 8 bytes. This accounts for the 64-bit required for the pointer that points to an address in memory but doesn't include the memory used for the actual string values. The actual memory required for a string value is 49 bytes plus an additional byte for each character. The deep parameter provides a more accurate memory usage report that accounts for the system-level memory consumption of the contained string data type.

Separately, we've provided a dli_utils.make_decimal() function to convert memory size into units based on powers of 2. In contrast to units based on powers of 10, this customary convention is commonly used to report memory capacity. More information about the two definitions can be found here.

# import dependencies
import pandas as pd
import sys
import random

# import utility
from dli_utils import make_decimal

# import data
df=pd.read_csv('2020-Mar.csv')

# preview DataFrame
df.head()

# convert feature as datetime data type
df['event_time']=pd.to_datetime(df['event_time'])

# lists each column at 8 bytes/row
memory_usage_df=df.memory_usage(index=False)
memory_usage_df.name='memory_usage'
dtypes_df=df.dtypes
dtypes_df.name='dtype'

# show each column uses roughly number of rows * 8 bytes
# 8 bytes from 64-bit numerical data as well as 8 bytes to store a pointer for object data type
byte_size=len(df) * 8 * len(df.columns)

print(f'Total memory use is {byte_size} bytes or ~{make_decimal(byte_size)}.')

pd.concat([memory_usage_df, dtypes_df], axis=1)

# lists each column's full memory usage
memory_usage_df=df.memory_usage(deep=True, index=False)
memory_usage_df.name='memory_usage'

byte_size=memory_usage_df.sum()

# show total memory usage
print(f'Total memory use is {byte_size} bytes or ~{make_decimal(byte_size)}.')

pd.concat([memory_usage_df, dtypes_df], axis=1)

# alternatively, use sys.getsizeof() instead
byte_size=sys.getsizeof(df)

print(f'Total memory use is {byte_size} bytes or ~{make_decimal(byte_size)}.')

# check random string-typed column
string_cols=[col for col in df.columns if df[col].dtype=='object' ]
column_to_check=random.choice(string_cols)

overhead=49
pointer_size=8

# nan==nan when value is not a number
# nan uses 32 bytes of memory
print(f'{column_to_check} column uses : {sum([(len(item)+overhead+pointer_size) if item==item else 32 for item in df[column_to_check].values])} bytes of memory.')

When Python stores a string, it actually uses memory for the overhead of the Python object, metadata about the string, and the string itself. The amount of memory usage we calculated includes temporary objects that get deallocated after the initial import. It's important to note that Python has memory optimization mechanics for strings such that when the same string is created multiple time, Python will cache or "intern" it in memory and reuse it for later string objects.

3.8 KiB Raw Blame History

Memory Utilization

3.8 KiB

Raw Blame History