NumPy vs Pandas: When to Use Each for Data Processing
The Problem You're Solving
You need to process 10 million rows of numerical data:
# β Pandas (reads into DataFrame first - memory intensive)
import pandas as pd
data = pd.read_csv('huge.csv') # 2.5 GB in memory
result = data['price'].sum()
# β
NumPy (processes arrays directly - 80% less memory)
import numpy as np
data = np.genfromtxt('huge.csv', delimiter=',')
result = data[:, 3].sum() # Only numeric column in memory
That difference = 2.5 GB vs 0.5 GB, 5x faster processing.
Without knowing when to use each, you'll write code that crashes on realistic data. With the right choice, you handle datasets 10x larger on the same hardware.
Data processing optimization appears in 22% of data science interviews and directly impacts model training speed.
NumPy Fundamentals
NumPy works with homogeneous arrays (all same type):
import numpy as np
# β
NumPy array - Fast, fixed type
arr = np.array([1, 2, 3, 4, 5])
print(type(arr)) # <class 'numpy.ndarray'>
print(arr.dtype) # int64 (fixed)
# β Python list - Slow, mixed types
lst = [1, 2, 3, 4, 5]
print(type(lst)) # <class 'list'>
Performance: NumPy vs Lists
import numpy as np
import time
# NumPy array
arr = np.arange(10_000_000)
# Python list
lst = list(range(10_000_000))
# Sum speed test
start = time.time()
np.sum(arr)
np_time = time.time() - start
start = time.time()
sum(lst)
py_time = time.time() - start
print(f"NumPy: {np_time:.4f}s") # ~0.01s
print(f"Python: {py_time:.4f}s") # ~0.5s
# NumPy is 50x faster!
Memory Usage: NumPy vs Lists
import sys
import numpy as np
# NumPy array (10 million integers)
arr = np.array([1] * 10_000_000)
print(f"NumPy memory: {arr.nbytes / 1e6:.1f} MB") # ~40 MB
# Python list (10 million integers)
lst = [1] * 10_000_000
print(f"Python list memory: {sys.getsizeof(lst) / 1e6:.1f} MB") # ~160 MB
# NumPy uses 4x less memory!
Pandas Fundamentals
Pandas works with heterogeneous data (mixed types):
import pandas as pd
# β
Pandas DataFrame - Can mix types
df = pd.DataFrame({
'id': [1, 2, 3], # int
'name': ['Alice', 'Bob', 'Charlie'], # str
'salary': [50000, 60000, 70000], # float
'hired': [True, False, True] # bool
})
print(df.dtypes)
# id int64
# name object
# salary float64
# hired bool
When Pandas Excels
import pandas as pd
# β
Perfect for mixed data
sales = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=5),
'product': ['A', 'B', 'A', 'C', 'B'],
'quantity': [10, 5, 8, 12, 6],
'price': [99.99, 49.99, 99.99, 15.99, 49.99],
'region': ['US', 'EU', 'US', 'ASIA', 'EU']
})
# Easy operations
print(sales[sales['quantity'] > 8]) # Filter
print(sales.groupby('product')['quantity'].sum()) # Group by
print(sales['price'].mean()) # Aggregate
Head-to-Head Comparison
Task 1: Sum 10M Numbers
import numpy as np
import pandas as pd
import time
data = np.arange(10_000_000)
# NumPy
start = time.time()
result_np = np.sum(data)
np_time = time.time() - start
# Pandas
start = time.time()
result_pd = pd.Series(data).sum()
pd_time = time.time() - start
print(f"NumPy: {np_time:.4f}s") # ~0.01s
print(f"Pandas: {pd_time:.4f}s") # ~0.05s
# NumPy is 5x faster for simple arrays
Task 2: Filter and Aggregate (Mixed Data)
import numpy as np
import pandas as pd
import time
# Setup: 1M rows of mixed data
np.random.seed(42)
data = {
'id': np.arange(1_000_000),
'value': np.random.rand(1_000_000),
'category': np.random.choice(['A', 'B', 'C'], 1_000_000),
'date': pd.date_range('2024-01-01', periods=1_000_000, freq='1H')
}
# NumPy approach (manual)
start = time.time()
values = data['value']
categories = data['category']
mask = (values > 0.5) & (categories == 'A')
result_np = values[mask].sum()
np_time = time.time() - start
# Pandas approach (built-in)
start = time.time()
df = pd.DataFrame(data)
result_pd = df[(df['value'] > 0.5) & (df['category'] == 'A')]['value'].sum()
pd_time = time.time() - start
print(f"NumPy: {np_time:.4f}s") # ~0.05s (but manual filtering)
print(f"Pandas: {pd_time:.4f}s") # ~0.03s (convenient)
# Pandas simpler AND faster here
Task 3: Numerical Computation (Matrix Operations)
import numpy as np
import pandas as pd
import time
# 5000x5000 matrix
matrix = np.random.rand(5000, 5000)
# NumPy (specialized operations)
start = time.time()
result_np = np.dot(matrix, matrix.T) # Matrix multiplication
np_time = time.time() - start
# Pandas (converts to NumPy internally)
start = time.time()
df = pd.DataFrame(matrix)
result_pd = df.T.dot(df) # Also uses NumPy
pd_time = time.time() - start
print(f"NumPy: {np_time:.4f}s") # ~2.5s
print(f"Pandas: {pd_time:.4f}s") # ~3.0s (overhead)
# NumPy better for pure numerical work
Decision Tree: NumPy vs Pandas
Question 1: Mixed data types?
YES β Use Pandas (easy type handling)
NO β Go to Question 2
Question 2: Heavy filtering/aggregation on columns?
YES β Use Pandas (groupby, pivot table)
NO β Go to Question 3
Question 3: Pure numerical computation?
YES β Use NumPy (fastest)
NO β Go to Question 4
Question 4: Need time series features?
YES β Use Pandas (built-in datetime)
NO β NumPy or Pandas equally fine
Real-World Examples
Example 1: ETL Pipeline (Pandas)
import pandas as pd
# Load mixed data
df = pd.read_csv('sales.csv')
# Clean: handle missing values
df['price'] = df['price'].fillna(df['price'].mean())
# Transform: add computed column
df['total'] = df['quantity'] * df['price']
# Aggregate: summarize by region
summary = df.groupby('region').agg({
'total': 'sum',
'quantity': 'mean',
'date': 'count'
}).reset_index()
# Output
summary.to_csv('regional_summary.csv', index=False)
Why Pandas: Mixed data types, date handling, easy groupby.
Example 2: Scientific Computation (NumPy)
import numpy as np
from scipy import optimize
# Generate 1M sensor readings
sensor_data = np.random.normal(loc=100, scale=15, size=1_000_000)
# Statistical operations (pure numeric)
mean = np.mean(sensor_data)
std = np.std(sensor_data)
percentile_95 = np.percentile(sensor_data, 95)
# Filter anomalies (> 3 std)
anomalies = sensor_data[np.abs(sensor_data - mean) > 3 * std]
print(f"Mean: {mean:.2f}")
print(f"Anomalies: {len(anomalies)}")
Why NumPy: Pure numerical, fast filtering, no type conversions needed.
Example 3: Time Series Analysis (Pandas)
import pandas as pd
# Load stock data
df = pd.read_csv('stock.csv', parse_dates=['date'])
# Resample: convert daily to weekly
weekly = df.set_index('date').resample('W')['price'].mean()
# Rolling window: 30-day moving average
df['ma_30'] = df['price'].rolling(window=30).mean()
# Percent change
df['pct_change'] = df['price'].pct_change()
# Filter: good trading days (up >5%)
good_days = df[df['pct_change'] > 0.05]
Why Pandas: Date indexing, resampling, time series methods.
Memory Optimization Strategies
Strategy 1: Read Only Needed Columns (Pandas)
# β SLOW - Loads all 50 columns
df = pd.read_csv('huge.csv')
result = df[['id', 'value']].sum()
# β
FAST - Loads only 2 columns
df = pd.read_csv('huge.csv', usecols=['id', 'value'])
result = df.sum()
# Even better: Use NumPy for pure numeric
import numpy as np
data = np.genfromtxt('huge.csv', delimiter=',', dtype=float, usecols=[0, 2])
Strategy 2: Use Appropriate Data Types
import pandas as pd
# β SLOW - Default types (int64, float64)
df = pd.DataFrame({
'age': [25, 30, 35], # Uses 64-bit, only needs 8-bit
'value': [1.5, 2.5, 3.5] # Uses 64-bit, needs 32-bit
})
print(df.memory_usage(deep=True)) # ~50 bytes
# β
FAST - Optimized types
df = pd.DataFrame({
'age': pd.array([25, 30, 35], dtype='uint8'), # 1 byte each
'value': np.array([1.5, 2.5, 3.5], dtype='float32') # 4 bytes each
})
print(df.memory_usage(deep=True)) # ~15 bytes
Strategy 3: Process in Chunks (Large Files)
import pandas as pd
# β SLOW - Load 1GB file all at once
df = pd.read_csv('huge.csv')
result = df['value'].sum()
# β
FAST - Process in 100MB chunks
chunk_size = 100_000
total = 0
for chunk in pd.read_csv('huge.csv', chunksize=chunk_size):
total += chunk['value'].sum()
Common Mistakes
β Mistake 1: Using Pandas for Pure Numeric Work
# SLOW - Pandas overhead
import pandas as pd
data = pd.Series(range(100_000_000))
result = (data ** 2).sum() # ~0.5s
# FAST - NumPy native
import numpy as np
data = np.arange(100_000_000)
result = (data ** 2).sum() # ~0.05s
β Mistake 2: Not Specifying Data Types
# SLOW - Pandas guesses types (slow)
df = pd.read_csv('data.csv')
# FAST - Specify explicit types
df = pd.read_csv('data.csv', dtype={
'id': 'int64',
'value': 'float32',
'category': 'category' # Saves memory for repeated values
})
β Mistake 3: Unnecessary Type Conversions
# SLOW - Converts between types
import pandas as pd
df = pd.read_csv('data.csv')
values = df['value'].values # Array β Series β NumPy
result = np.sum(values)
# FAST - Direct NumPy operation
import numpy as np
values = np.genfromtxt('data.csv', delimiter=',', usecols=[1])
result = np.sum(values)
FAQ: Common Questions & Optimization
Q1: Should I always use NumPy for speed?
A: No. Pandas is faster when you need convenient filtering/grouping.
# β Slow NumPy (manual filtering on mixed data)
import numpy as np
data = np.array([(1, 'A', 100), (2, 'B', 200)], dtype=object)
# Complex manual indexing needed
# β
Fast Pandas (built-in filtering)
import pandas as pd
df = pd.DataFrame({
'id': [1, 2],
'category': ['A', 'B'],
'value': [100, 200]
})
result = df[df['category'] == 'A'] # Easy!
Q2: Memory: Can I use NumPy on 100GB data?
A: Not directly, but chunking works.
import numpy as np
# Process 100GB file in 1GB chunks
chunk_size = 10_000_000
total = 0
with open('huge.dat', 'rb') as f:
while True:
chunk = np.fromfile(f, dtype=np.float64, count=chunk_size)
if len(chunk) == 0:
break
total += chunk.sum()
print(f"Total: {total}")
Q3: Why is Pandas slower for simple operations?
A: Overhead of DataFrame structure (indexing, type checking).
import numpy as np
import pandas as pd
import time
# 10M integers
arr = np.arange(10_000_000)
series = pd.Series(arr)
# NumPy (pure)
start = time.time()
result = np.sum(arr)
print(f"NumPy: {time.time() - start:.4f}s") # ~0.001s
# Pandas (includes index, dtype checks)
start = time.time()
result = series.sum()
print(f"Pandas: {time.time() - start:.4f}s") # ~0.01s
Q4: When should I use NumPy over built-in Python?
A: Always. NumPy is 50-100x faster for numerical work.
import numpy as np
import time
# Python list
start = time.time()
lst = [i ** 2 for i in range(10_000_000)]
py_time = time.time() - start
# NumPy
start = time.time()
arr = (np.arange(10_000_000) ** 2)
np_time = time.time() - start
print(f"Python: {py_time:.4f}s") # ~0.5s
print(f"NumPy: {np_time:.4f}s") # ~0.01s
# NumPy is 50x faster!
Q5: Can I mix NumPy and Pandas?
A: Yes, they work together seamlessly.
import numpy as np
import pandas as pd
# Start with Pandas for data loading
df = pd.read_csv('data.csv')
# Extract NumPy array for computation
values = df['price'].values # NumPy array
# Do NumPy operations
result = np.mean(values ** 2)
# Put back in Pandas
df['result'] = result
Q6: How do I profile which is faster?
A: Use timeit or line_profiler.
import timeit
import numpy as np
import pandas as pd
arr = np.arange(1_000_000)
series = pd.Series(arr)
# Time NumPy
np_time = timeit.timeit(lambda: np.sum(arr), number=1000)
print(f"NumPy: {np_time:.4f}s")
# Time Pandas
pd_time = timeit.timeit(lambda: series.sum(), number=1000)
print(f"Pandas: {pd_time:.4f}s")
Q7: Interview Question: Optimize DataFrame with 1M rows Γ 100 columns.
A: Here's the strategy:
import pandas as pd
import numpy as np
# Load with optimizations
df = pd.read_csv('huge.csv', dtype={
'age': 'uint8', # 0-255, uses 1 byte
'salary': 'float32', # 32-bit instead of 64-bit
'category': 'category', # Repeated values
'date': 'datetime64[ns]' # Efficient date storage
})
# Use only needed columns
df = df[['age', 'salary', 'category']]
# Use NumPy for pure calculations
median_salary = np.median(df['salary'].values)
# Use Pandas for grouped operations
summary = df.groupby('category').agg({
'salary': 'mean',
'age': 'median'
})
Interview insight: "I'd profile memory first with df.memory_usage(deep=True), then optimize types and use chunking for 1GB+ files."
Q8: When should I use NumPy matrix vs array?
A: Use arrays. Matrices are legacy.
import numpy as np
# β
CORRECT - Use arrays
arr = np.array([[1, 2], [3, 4]])
result = arr @ arr.T # Matrix multiplication
# β OLD - Avoid matrices
mat = np.matrix([[1, 2], [3, 4]])
result = mat * mat.T # Works but deprecated
Q9: Performance on different operations?
A: Benchmark common operations:
import numpy as np
import pandas as pd
import time
# Setup
arr = np.arange(10_000_000)
df = pd.DataFrame({'value': arr})
# Sum
np_sum = timeit.timeit(lambda: np.sum(arr), number=100)
pd_sum = timeit.timeit(lambda: df['value'].sum(), number=100)
# Mean
np_mean = timeit.timeit(lambda: np.mean(arr), number=100)
pd_mean = timeit.timeit(lambda: df['value'].mean(), number=100)
# Std deviation
np_std = timeit.timeit(lambda: np.std(arr), number=100)
pd_std = timeit.timeit(lambda: df['value'].std(), number=100)
# Result: NumPy faster for pure operations, Pandas better for groupby
Q10: Which should I learn first?
A: Pandas first (more practical), then NumPy (more powerful).
Beginner: Pandas (CSV reading, filtering, groupby)
β
Intermediate: NumPy (vectorization, performance)
β
Advanced: Both (use where each excels)
Conclusion
NumPy vs Pandas isn't either/orβit's using the right tool for the job:
- Pure numerical data β NumPy (50x faster)
- Mixed/structured data β Pandas (easier, more features)
- Large files β Process in chunks
- Time series β Pandas (built-in support)
- Performance critical β Profile first, optimize second
Master both and you'll write data pipelines that handle 1B+ rows efficiently.