Pandas pd.concat(): Merge & Combine DataFrames (Complete Guide)
pd.concat() is one of the most essential Pandas functions for combining multiple DataFrames. Whether you're concatenating rows, columns, or handling indices, this guide covers everything you need to know.
What is pd.concat()?
pd.concat() combines multiple Pandas DataFrames or Series into a single DataFrame.
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
# Combine them
result = pd.concat([df1, df2])
print(result)
# A B
# 0 1 3
# 1 2 4
# 0 5 7
# 1 6 8
Notice the indices repeat (0, 1, 0, 1). You often need to reset them.
Basic Syntax
pd.concat(objs, axis=0, ignore_index=False, keys=None, sort=False)
Parameters:
objs: List of DataFrames/Series to combineaxis: 0 (rows) or 1 (columns)ignore_index: If True, create new 0-1-2... indexkeys: Add hierarchical index levelssort: Sort column names (default False)
Combining Rows (axis=0) - Most Common
Basic Row Concatenation
df1 = pd.DataFrame({
'Name': ['Alice', 'Bob'],
'Age': [25, 30],
'City': ['NYC', 'LA']
})
df2 = pd.DataFrame({
'Name': ['Charlie', 'David'],
'Age': [35, 28],
'City': ['Chicago', 'Houston']
})
# Combine by rows
result = pd.concat([df1, df2], axis=0)
print(result)
# Name Age City
# 0 Alice 25 NYC
# 1 Bob 30 LA
# 0 Charlie 35 Chicago
# 1 David 28 Houston
Reset Index for Clean Results
result = pd.concat([df1, df2], axis=0, ignore_index=True)
print(result)
# Name Age City
# 0 Alice 25 NYC
# 1 Bob 30 LA
# 2 Charlie 35 Chicago
# 3 David 28 Houston
Much cleaner! ignore_index=True creates a new 0-1-2-3 index.
Combining Rows with Different Columns
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'C': [7, 8]}) # Note: has C, not B
result = pd.concat([df1, df2], axis=0, ignore_index=True)
print(result)
# A B C
# 0 1 3.0 NaN
# 1 2 4.0 NaN
# 2 5 NaN 7.0
# 3 6 NaN 8.0
Missing columns are filled with NaN.
Combining Columns (axis=1)
Basic Column Concatenation
df1 = pd.DataFrame({
'Name': ['Alice', 'Bob'],
'Age': [25, 30]
})
df2 = pd.DataFrame({
'Salary': [50000, 60000],
'Department': ['Sales', 'Engineering']
})
# Combine by columns (side by side)
result = pd.concat([df1, df2], axis=1)
print(result)
# Name Age Salary Department
# 0 Alice 25 50000 Sales
# 1 Bob 30 60000 Engineering
This is equivalent to adding new columns to df1.
Column Concatenation with Different Indices
df1 = pd.DataFrame({'A': [1, 2]}, index=['row1', 'row2'])
df2 = pd.DataFrame({'B': [3, 4]}, index=['row2', 'row3'])
result = pd.concat([df1, df2], axis=1)
print(result)
# A B
# row1 1.0 NaN
# row2 2.0 3.0
# row3 NaN 4.0
Missing index combinations are filled with NaN.
pd.concat() vs merge() vs join()
| Method | Use Case | Axis | Index |
|---|---|---|---|
| concat() | Simple row/column stacking | Both | Preserves or resets |
| merge() | SQL-like joins (inner, outer) | 1 (columns) | Match on keys |
| join() | Join by index | 1 (columns) | Always by index |
When to Use Each
Use concat() for:
- Stacking multiple DataFrames (same structure)
- Combining rows from different sources
- Simple column addition
Use merge() for:
- Joining on specific columns
- Inner/outer/left/right joins
- When you need SQL-like behavior
Use join() for:
- Quick joins by index (no key matching)
- Cleaner syntax when both DataFrames have matching indices
Example Comparison
df1 = pd.DataFrame({
'Name': ['Alice', 'Bob'],
'Age': [25, 30]
}, index=[1, 2])
df2 = pd.DataFrame({
'Salary': [50000, 60000]
}, index=[1, 2])
# All three give the same result:
print(pd.concat([df1, df2], axis=1))
print(df1.merge(df2, left_index=True, right_index=True))
print(df1.join(df2))
# Name Age Salary
# 1 Alice 25 50000
# 2 Bob 30 60000
All work, but concat() is simplest for this case.
Real-World Example: Combining CSV Files
import pandas as pd
import glob
# Read all CSV files from a folder
csv_files = glob.glob('data/*.csv')
dfs = [pd.read_csv(file) for file in csv_files]
# Combine all into one DataFrame
result = pd.concat(dfs, axis=0, ignore_index=True)
# Save combined data
result.to_csv('all_data_combined.csv', index=False)
This is perfect for processing data from multiple sources.
Handling Indices: keys Parameter
df1 = pd.DataFrame({'data': [1, 2]})
df2 = pd.DataFrame({'data': [3, 4]})
df3 = pd.DataFrame({'data': [5, 6]})
# Add hierarchical index showing source
result = pd.concat(
[df1, df2, df3],
keys=['Jan', 'Feb', 'Mar']
)
print(result)
# data
# Jan 0 1
# 1 2
# Feb 0 3
# 1 4
# Mar 0 5
# 1 6
Useful for tracking where each row came from.
Performance: concat() vs append()
Warning: DataFrame.append() is deprecated in modern Pandas. Use concat() instead.
# OLD (deprecated)
result = df1.append(df2)
# NEW (correct)
result = pd.concat([df1, df2])
concat() is faster and more flexible.
Handling Duplicate Columns
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
# Duplicate column names
result = pd.concat([df1, df2], axis=0, ignore_index=True)
print(result)
# A B
# 0 1 3
# 1 2 4
# 2 5 7
# 3 6 8
Works fine. The 'A' column from both DataFrames gets combined into one column.
Sorting Columns After Concatenation
df1 = pd.DataFrame({'B': [1, 2]})
df2 = pd.DataFrame({'A': [3, 4]})
# Default (no sorting)
result = pd.concat([df1, df2], axis=1)
print(result)
# B A
# 0 1 3
# 1 2 4
# Sort columns
result = pd.concat([df1, df2], axis=1, sort=True)
print(result)
# A B
# 0 3 1
# 1 4 2
Use sort=True when you want alphabetical column ordering.
Ignoring Index vs Preserving It
df1 = pd.DataFrame({'A': [1, 2]}, index=['a', 'b'])
df2 = pd.DataFrame({'A': [3, 4]}, index=['c', 'd'])
# Preserve original indices
result = pd.concat([df1, df2], ignore_index=False)
print(result)
# A
# a 1
# b 2
# c 3
# d 4
# Reset to 0-1-2-3
result = pd.concat([df1, df2], ignore_index=True)
print(result)
# A
# 0 1
# 1 2
# 2 3
# 3 4
Choose based on whether you need the original indices.
Common Errors
ā Error 1: Empty DataFrame List
# WRONG
result = pd.concat([]) # Error: no objects to concatenate
# CORRECT
result = pd.concat([df1, df2])
ā Error 2: Forgetting ignore_index
# WRONG - duplicate indices
result = pd.concat([df1, df2]) # Indices: [0, 1, 0, 1]
# CORRECT
result = pd.concat([df1, df2], ignore_index=True) # Indices: [0, 1, 2, 3]
ā Error 3: Using append() (Deprecated)
# WRONG - deprecated
result = df1.append(df2)
# CORRECT
result = pd.concat([df1, df2])
Best Practices
- Always use
ignore_index=Truewhen combining row-wise data (unless you need specific indices) - Use
concat()for simple stacking,merge()for complex joins - Check column alignment when combining DataFrames with different columns
- Watch out for NaN values when columns don't match
- Use
keys=to track source of rows in multi-source data - Prefer
concat()over deprecatedappend()
Real-World Workflow
import pandas as pd
# 1. Read multiple data sources
sales_jan = pd.read_csv('sales_jan.csv')
sales_feb = pd.read_csv('sales_feb.csv')
sales_mar = pd.read_csv('sales_mar.csv')
# 2. Combine all months
all_sales = pd.concat([sales_jan, sales_feb, sales_mar], ignore_index=True)
# 3. Add computed columns from another source
regions = pd.read_csv('regions.csv')
merged = all_sales.merge(regions, on='store_id')
# 4. Save combined data
merged.to_csv('all_sales_with_regions.csv', index=False)
Conclusion
pd.concat() is essential for any data work in Python:
- Simple row stacking:
concat([df1, df2], ignore_index=True) - Column addition:
concat([df1, df2], axis=1) - Multi-source data: Use
keys=to track origins - Performance: Faster than deprecated
append()
Master concat() and you've mastered one of Pandas' most powerful tools.