Read Multiple CSV Files into one Frame in Python
Link to Source data
Pandas
Via read_csv
import pandas as pd
from pathlib import Path
# create a Path instance and filter for only csv files
files = Path("./Data_files/multiple_csvs/").rglob("*.csv")
# read in all the csv files
all_csvs = [pd.read_csv(file) for file in files]
# lump into one table
all_csvs = pd.concat(all_csvs)
all_csvs
City | Date | Sales | StoreID | |
---|---|---|---|---|
0 | Portland | 1/1/2014 | $45.27 | 1002 |
1 | Portland | 1/1/2014 | $115.61 | 1002 |
2 | Portland | 1/1/2014 | $35.33 | 1004 |
3 | Portland | 1/1/2014 | $20.95 | 1004 |
4 | Portland | 1/1/2014 | $14.25 | 1004 |
... | ... | ... | ... | ... |
216073 | Tacoma | 12/31/2015 | $33.68 | 1523 |
216074 | Tacoma | 12/31/2015 | $215.98 | 1521 |
216075 | Tacoma | 12/31/2015 | $236.86 | 1521 |
216076 | Tacoma | 12/31/2015 | $33.02 | 1522 |
216077 | Tacoma | 12/31/2015 | $11.32 | 1523 |
880349 rows × 4 columns
Via read_csv and the command line
import subprocess
from io import StringIO
data = subprocess.run("awk '(NR==1) || (FNR>1)' ./Data_files/multiple_csvs/*.csv",
shell=True,
capture_output=True,
text=True).stdout
df = pd.read_csv(StringIO(data))
df
City | Date | Sales | StoreID | |
---|---|---|---|---|
0 | Oakland | 1/1/2014 | $9.83 | 982 |
1 | Oakland | 1/1/2014 | $28.18 | 983 |
2 | Oakland | 1/1/2014 | $6.83 | 982 |
3 | Oakland | 1/1/2014 | $43.90 | 982 |
4 | Oakland | 1/1/2014 | $17.16 | 980 |
... | ... | ... | ... | ... |
880344 | Tacoma | 12/31/2015 | $33.68 | 1523 |
880345 | Tacoma | 12/31/2015 | $215.98 | 1521 |
880346 | Tacoma | 12/31/2015 | $236.86 | 1521 |
880347 | Tacoma | 12/31/2015 | $33.02 | 1522 |
880348 | Tacoma | 12/31/2015 | $11.32 | 1523 |
880349 rows × 4 columns
datatable
Via iread
from datatable import fread, iread, rbind
files = Path("./Data_files/multiple_csvs/").rglob("*.csv")
# iread returns an iterator of all files
all_csvs = iread(tuple(files))
# combine into one table with rbind
all_csvs = rbind(all_csvs)
all_csvs
City | Date | Sales | StoreID | |
---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | |
0 | Portland | 1/1/2014 | $45.27 | 1002 |
1 | Portland | 1/1/2014 | $115.61 | 1002 |
2 | Portland | 1/1/2014 | $35.33 | 1004 |
3 | Portland | 1/1/2014 | $20.95 | 1004 |
4 | Portland | 1/1/2014 | $14.25 | 1004 |
5 | Portland | 1/1/2014 | $37.30 | 1002 |
6 | Portland | 1/1/2014 | $38.40 | 1003 |
7 | Portland | 1/1/2014 | $35.50 | 1003 |
8 | Portland | 1/1/2014 | $37.49 | 1005 |
9 | Portland | 1/1/2014 | $11.17 | 1002 |
10 | Portland | 1/1/2014 | $179.88 | 1002 |
11 | Portland | 1/1/2014 | $42.19 | 1003 |
12 | Portland | 1/1/2014 | $80.73 | 1004 |
13 | Portland | 1/1/2014 | $118.58 | 1003 |
14 | Portland | 1/1/2014 | $132.68 | 1005 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
880,344 | Tacoma | 12/31/2015 | $33.68 | 1523 |
880,345 | Tacoma | 12/31/2015 | $215.98 | 1521 |
880,346 | Tacoma | 12/31/2015 | $236.86 | 1521 |
880,347 | Tacoma | 12/31/2015 | $33.02 | 1522 |
880,348 | Tacoma | 12/31/2015 | $11.32 | 1523 |
Via fread
files = Path("./Data_files/multiple_csvs/").rglob("*.csv")
all_csvs = [fread(filename) for filename in files]
# combine into one frame
rbind(all_csvs)
City | Date | Sales | StoreID | |
---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | |
0 | Portland | 1/1/2014 | $45.27 | 1002 |
1 | Portland | 1/1/2014 | $115.61 | 1002 |
2 | Portland | 1/1/2014 | $35.33 | 1004 |
3 | Portland | 1/1/2014 | $20.95 | 1004 |
4 | Portland | 1/1/2014 | $14.25 | 1004 |
5 | Portland | 1/1/2014 | $37.30 | 1002 |
6 | Portland | 1/1/2014 | $38.40 | 1003 |
7 | Portland | 1/1/2014 | $35.50 | 1003 |
8 | Portland | 1/1/2014 | $37.49 | 1005 |
9 | Portland | 1/1/2014 | $11.17 | 1002 |
10 | Portland | 1/1/2014 | $179.88 | 1002 |
11 | Portland | 1/1/2014 | $42.19 | 1003 |
12 | Portland | 1/1/2014 | $80.73 | 1004 |
13 | Portland | 1/1/2014 | $118.58 | 1003 |
14 | Portland | 1/1/2014 | $132.68 | 1005 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
880,344 | Tacoma | 12/31/2015 | $33.68 | 1523 |
880,345 | Tacoma | 12/31/2015 | $215.98 | 1521 |
880,346 | Tacoma | 12/31/2015 | $236.86 | 1521 |
880,347 | Tacoma | 12/31/2015 | $33.02 | 1522 |
880,348 | Tacoma | 12/31/2015 | $11.32 | 1523 |
Via fread and the command line
# less verbose than the Pandas option
all_csvs = fread(cmd="awk '(NR==1) || (FNR>1)' ./Data_files/multiple_csvs/*.csv")
all_csvs
City | Date | Sales | StoreID | |
---|---|---|---|---|
▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | ▪▪▪▪ | |
0 | Oakland | 1/1/2014 | $9.83 | 982 |
1 | Oakland | 1/1/2014 | $28.18 | 983 |
2 | Oakland | 1/1/2014 | $6.83 | 982 |
3 | Oakland | 1/1/2014 | $43.90 | 982 |
4 | Oakland | 1/1/2014 | $17.16 | 980 |
5 | Oakland | 1/1/2014 | $14.29 | 982 |
6 | Oakland | 1/1/2014 | $6.48 | 982 |
7 | Oakland | 1/1/2014 | $232.13 | 982 |
8 | Oakland | 1/1/2014 | $209.28 | 981 |
9 | Oakland | 1/1/2014 | $12.11 | 981 |
10 | Oakland | 1/1/2014 | $107.96 | 982 |
11 | Oakland | 1/1/2014 | $47.64 | 980 |
12 | Oakland | 1/1/2014 | $15.44 | 981 |
13 | Oakland | 1/1/2014 | $52.72 | 982 |
14 | Oakland | 1/1/2014 | $3.25 | 981 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
880,344 | Tacoma | 12/31/2015 | $33.68 | 1523 |
880,345 | Tacoma | 12/31/2015 | $215.98 | 1521 |
880,346 | Tacoma | 12/31/2015 | $236.86 | 1521 |
880,347 | Tacoma | 12/31/2015 | $33.02 | 1522 |
880,348 | Tacoma | 12/31/2015 | $11.32 | 1523 |
Resources Used
datatable - fread
datatable - iread
pandas - read_csv
python - subprocess
python - pathlib
StackOverflow - read csv files via the command line