3.8 KiB
Memory Utilization
Memory utilization on a DataFrame depends largely on the date types for each column.
We can use DataFrame.memory_usage() to see the memory usage for each column (in bytes). Most of the common data types have a fixed size in memory, such as int, float, datetime, and bool. Memory usage for these data types is the respective memory requirement multiplied by the number of data points. For string data types, the memory usage reported is the number of data points times 8 bytes. This accounts for the 64-bit required for the pointer that points to an address in memory but doesn't include the memory used for the actual string values. The actual memory required for a string value is 49 bytes plus an additional byte for each character. The deep parameter provides a more accurate memory usage report that accounts for the system-level memory consumption of the contained string data type.
Separately, we've provided a dli_utils.make_decimal() function to convert memory size into units based on powers of 2. In contrast to units based on powers of 10, this customary convention is commonly used to report memory capacity. More information about the two definitions can be found here.
# import dependencies
import pandas as pd
import sys
import random
# import utility
from dli_utils import make_decimal
# import data
df=pd.read_csv('2020-Mar.csv')
# preview DataFrame
df.head()
# convert feature as datetime data type
df['event_time']=pd.to_datetime(df['event_time'])
# lists each column at 8 bytes/row
memory_usage_df=df.memory_usage(index=False)
memory_usage_df.name='memory_usage'
dtypes_df=df.dtypes
dtypes_df.name='dtype'
# show each column uses roughly number of rows * 8 bytes
# 8 bytes from 64-bit numerical data as well as 8 bytes to store a pointer for object data type
byte_size=len(df) * 8 * len(df.columns)
print(f'Total memory use is {byte_size} bytes or ~{make_decimal(byte_size)}.')
pd.concat([memory_usage_df, dtypes_df], axis=1)
# lists each column's full memory usage
memory_usage_df=df.memory_usage(deep=True, index=False)
memory_usage_df.name='memory_usage'
byte_size=memory_usage_df.sum()
# show total memory usage
print(f'Total memory use is {byte_size} bytes or ~{make_decimal(byte_size)}.')
pd.concat([memory_usage_df, dtypes_df], axis=1)
# alternatively, use sys.getsizeof() instead
byte_size=sys.getsizeof(df)
print(f'Total memory use is {byte_size} bytes or ~{make_decimal(byte_size)}.')
# check random string-typed column
string_cols=[col for col in df.columns if df[col].dtype=='object' ]
column_to_check=random.choice(string_cols)
overhead=49
pointer_size=8
# nan==nan when value is not a number
# nan uses 32 bytes of memory
print(f'{column_to_check} column uses : {sum([(len(item)+overhead+pointer_size) if item==item else 32 for item in df[column_to_check].values])} bytes of memory.')

