Replicating .SD in Python Datatable

.SD - Subset of Data

I will be using Jose Morales excellent post to show how .SD's functionality can be replicated in python's datatable. Not all functions can be replicated; R data.table has a whole lot more functions and features that are not yet implemented in datatable.

Link to Notebook

from datatable import dt, by, sort, mean, count, update, max, f, fread
df = fread('Data_files/iris.csv')
df.head()
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa
55.43.91.70.4setosa
64.63.41.40.3setosa
753.41.50.2setosa
84.42.91.40.2setosa
94.93.11.50.1setosa

Number of unique observations per column

# DT[, lapply(.SD, uniqueN)] --> Rdatatable

df.nunique()
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0352343223

Mean of all columns by species

# DT[, lapply(.SD, mean), by = species] --> Rdatatable

df[:, mean(f[:]), by('species')]
speciessepal_lengthsepal_widthpetal_lengthpetal_width
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0setosa5.0063.4281.4620.246
1versicolor5.9362.774.261.326
2virginica6.5882.9745.5522.026

Filtering

First two observations by species

# DT[, .SD[1:2], by = species]

df[:2, :, by('species')]
speciessepal_lengthsepal_widthpetal_lengthpetal_width
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0setosa5.13.51.40.2
1setosa4.931.40.2
2versicolor73.24.71.4
3versicolor6.43.24.51.5
4virginica6.33.362.5
5virginica5.82.75.11.9

In datatable, rows are selected in the i section after the grouping, unlike in R's data.table, where rows are selected in i before grouping, and rows selected in the .SD after grouping.

Last two observations by species

# DT[, tail(.SD, 2), by = species] 

df[-2:, :, by('species')]
speciessepal_lengthsepal_widthpetal_lengthpetal_width
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0setosa5.33.71.50.2
1setosa53.31.40.2
2versicolor5.12.531.1
3versicolor5.72.84.11.3
4virginica6.23.45.42.3
5virginica5.935.11.8

Again, the rows are selected after grouping by using Python's negative index slicing.

Select the top two sorted by sepal length in descending order

# DT[order(-sepal_length), head(.SD, 2), by = species] 

df[:2, :, by('species'), sort(-f.sepal_length)]
speciessepal_lengthsepal_widthpetal_lengthpetal_width
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0setosa5.841.20.2
1setosa5.74.41.50.4
2versicolor73.24.71.4
3versicolor6.93.14.91.5
4virginica7.93.86.42
5virginica7.73.86.72.2

In datatable, the sort function replicates the order function in R's data.table. Note the - symbol before the sepal_length f-expression; this instructs the dataframe to sort in descending order.

Select the top two sorted by the difference between the sepal length and sepal width

# DT[order(sepal_length - sepal_width), head(.SD, 2), by = species] 

df[:2, :, by('species'), sort(f.sepal_length - f.sepal_width)]
speciessepal_lengthsepal_widthpetal_lengthpetal_width
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0setosa4.63.610.2
1setosa5.24.11.50.1
2versicolor5.434.51.5
3versicolor5.22.73.91.4
4virginica4.92.54.51.7
5virginica5.62.84.92

Just like in R's data.table, boolean expressions can be passed to the sort function.

Filter observations above the mean of sepal_length by species

# DT[, .SD[sepal_length > mean(sepal_length)], by = species] 

df[:, update(temp = f.sepal_length > mean(f.sepal_length)), by('species')]
df[f.temp == 1, f[:-1]]
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa
15.43.91.70.4setosa
25.43.71.50.2setosa
35.841.20.2setosa
45.74.41.50.4setosa
55.43.91.30.4setosa
65.13.51.40.3setosa
75.73.81.70.3setosa
85.13.81.50.3setosa
95.43.41.70.2setosa
105.13.71.50.4setosa
115.13.31.70.5setosa
125.23.51.50.2setosa
135.23.41.40.2setosa
145.43.41.50.4setosa
636.73.15.62.4virginica
646.93.15.12.3virginica
656.83.25.92.3virginica
666.73.35.72.5virginica
676.735.22.3virginica

Unlike in R's data.table, boolean expressions can not be applied within the i section, in the presence of by. The next best thing is to break it down into two steps - create a temporary column to hold the boolean value, and then filter on that column.

Filter rows with group size greater than 10

# DT[, .SD[.N > 10], keyby = .(species, petal_width)] 

df[:, update(temp = count() > 10), by('species', 'petal_width')]
df[f.temp == 1, f[:-1]]
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.13.51.40.2setosa
14.931.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
453.61.40.2setosa
553.41.50.2setosa
64.42.91.40.2setosa
75.43.71.50.2setosa
84.83.41.60.2setosa
95.841.20.2setosa
105.43.41.70.2setosa
114.63.610.2setosa
124.83.41.90.2setosa
13531.60.2setosa
145.23.51.50.2setosa
486.22.84.81.8virginica
496.134.91.8virginica
506.43.15.51.8virginica
51634.81.8virginica
525.935.11.8virginica

Get the row with the max petal_length by species.

# DT[, .SD[which.max(petal_length)], by = species] OR 
# DT[, .SD[petal_length == max(petal_length)], by = species]  

# get rid of temp column
del df['temp']

df[0, :, by('species'), sort(-f.petal_length)]
speciessepal_lengthsepal_widthpetal_lengthpetal_width
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
0setosa4.83.41.90.2
1versicolor62.75.11.6
2virginica7.72.66.92.3

In the above code, we take advantage of the fact that sorting is done within each group; this allows us to pick the first row per group when petal_length is sorted in descending order.

.SDCols

Including columns in .SD

# col_idx <- grep("^sepal", names(DT)) --> filter for the specicfic columns
# DT[, lapply(.SD, mean), .SDcols = col_idx]

# filter for the specific columns with a list comprehension
names = [f[name] for name in df.names
         if name.startswith('sepal')]

df[:, mean(names)]
sepal_lengthsepal_width
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.843333.05733

Removing columns from .SD

# col_idx <- grep("^(petal|species)", names(DT))
# DT[, lapply(.SD, mean), .SDcols = -col_idx] --> exclusion occurs within .SDcols

# here, exclusion occurs within the list comprehension
names = [f[name] for name in df.names 
         if not name.startswith(('petal','species'))] 

df[:, mean(names)]
sepal_lengthsepal_width
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.843333.05733

Column ranges

# DT[, lapply(.SD, mean), .SDcols = sepal_length:sepal_width]

df[:, mean(f['sepal_length':'sepal_width'])]
sepal_lengthsepal_width
▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪
05.843333.05733

Summary

We've seen how to replicate .SD in datatable. There are other functionalities in .SD that are not presently possible in Python's datatable. It is possible that in the future, .SD will be implemented to allow for custom aggregation functions. That would be truly awesome, as it would allow numpy functions and functions from other Python libraries into datatable.