Read Multiple CSV Files into one Frame in Python

Link to Notebook

Pandas

Via read_csv

import pandas as pd
from pathlib import Path

# create a Path instance and filter for only csv files
files = Path("./Data_files/multiple_csvs/").rglob("*.csv")

# read in all the csv files
all_csvs = [pd.read_csv(file) for file in files]

# lump into one table
all_csvs = pd.concat(all_csvs)

all_csvs
City Date Sales StoreID
0 Portland 1/1/2014 $45.27 1002
1 Portland 1/1/2014 $115.61 1002
2 Portland 1/1/2014 $35.33 1004
3 Portland 1/1/2014 $20.95 1004
4 Portland 1/1/2014 $14.25 1004
... ... ... ... ...
216073 Tacoma 12/31/2015 $33.68 1523
216074 Tacoma 12/31/2015 $215.98 1521
216075 Tacoma 12/31/2015 $236.86 1521
216076 Tacoma 12/31/2015 $33.02 1522
216077 Tacoma 12/31/2015 $11.32 1523

880349 rows × 4 columns

Via read_csv and the command line

import subprocess
from io import StringIO

data = subprocess.run("awk '(NR==1) || (FNR>1)' ./Data_files/multiple_csvs/*.csv", 
                      shell=True, 
                      capture_output=True,
                      text=True).stdout

df = pd.read_csv(StringIO(data))

df
City Date Sales StoreID
0 Oakland 1/1/2014 $9.83 982
1 Oakland 1/1/2014 $28.18 983
2 Oakland 1/1/2014 $6.83 982
3 Oakland 1/1/2014 $43.90 982
4 Oakland 1/1/2014 $17.16 980
... ... ... ... ...
880344 Tacoma 12/31/2015 $33.68 1523
880345 Tacoma 12/31/2015 $215.98 1521
880346 Tacoma 12/31/2015 $236.86 1521
880347 Tacoma 12/31/2015 $33.02 1522
880348 Tacoma 12/31/2015 $11.32 1523

880349 rows × 4 columns

datatable

Via iread

from datatable import fread, iread, rbind

files = Path("./Data_files/multiple_csvs/").rglob("*.csv")

# iread returns an iterator of all files
all_csvs = iread(tuple(files))

# combine into one table with rbind
all_csvs = rbind(all_csvs)

all_csvs
CityDateSalesStoreID
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0Portland1/1/2014$45.271002
1Portland1/1/2014$115.611002
2Portland1/1/2014$35.331004
3Portland1/1/2014$20.951004
4Portland1/1/2014$14.251004
5Portland1/1/2014$37.301002
6Portland1/1/2014$38.401003
7Portland1/1/2014$35.501003
8Portland1/1/2014$37.491005
9Portland1/1/2014$11.171002
10Portland1/1/2014$179.881002
11Portland1/1/2014$42.191003
12Portland1/1/2014$80.731004
13Portland1/1/2014$118.581003
14Portland1/1/2014$132.681005
880,344Tacoma12/31/2015$33.681523
880,345Tacoma12/31/2015$215.981521
880,346Tacoma12/31/2015$236.861521
880,347Tacoma12/31/2015$33.021522
880,348Tacoma12/31/2015$11.321523

Via fread

files = Path("./Data_files/multiple_csvs/").rglob("*.csv")

all_csvs = [fread(filename) for filename in files]

# combine into one frame
rbind(all_csvs)
CityDateSalesStoreID
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0Portland1/1/2014$45.271002
1Portland1/1/2014$115.611002
2Portland1/1/2014$35.331004
3Portland1/1/2014$20.951004
4Portland1/1/2014$14.251004
5Portland1/1/2014$37.301002
6Portland1/1/2014$38.401003
7Portland1/1/2014$35.501003
8Portland1/1/2014$37.491005
9Portland1/1/2014$11.171002
10Portland1/1/2014$179.881002
11Portland1/1/2014$42.191003
12Portland1/1/2014$80.731004
13Portland1/1/2014$118.581003
14Portland1/1/2014$132.681005
880,344Tacoma12/31/2015$33.681523
880,345Tacoma12/31/2015$215.981521
880,346Tacoma12/31/2015$236.861521
880,347Tacoma12/31/2015$33.021522
880,348Tacoma12/31/2015$11.321523

Via fread and the command line

# less verbose than the Pandas option
all_csvs = fread(cmd="awk '(NR==1) || (FNR>1)' ./Data_files/multiple_csvs/*.csv")
all_csvs
CityDateSalesStoreID
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0Oakland1/1/2014$9.83982
1Oakland1/1/2014$28.18983
2Oakland1/1/2014$6.83982
3Oakland1/1/2014$43.90982
4Oakland1/1/2014$17.16980
5Oakland1/1/2014$14.29982
6Oakland1/1/2014$6.48982
7Oakland1/1/2014$232.13982
8Oakland1/1/2014$209.28981
9Oakland1/1/2014$12.11981
10Oakland1/1/2014$107.96982
11Oakland1/1/2014$47.64980
12Oakland1/1/2014$15.44981
13Oakland1/1/2014$52.72982
14Oakland1/1/2014$3.25981
880,344Tacoma12/31/2015$33.681523
880,345Tacoma12/31/2015$215.981521
880,346Tacoma12/31/2015$236.861521
880,347Tacoma12/31/2015$33.021522
880,348Tacoma12/31/2015$11.321523

Resources Used

datatable - fread

datatable - iread

pandas - read_csv

python - subprocess

python - pathlib

StackOverflow - read csv files via the command line