.SD - Subset of Data
I will be using Jose Morales excellent post to show how .SD's functionality can be replicated in python's datatable. Not all functions can be replicated; R data.table has a whole lot more functions and features that are not yet implemented in datatable.
Link to Notebook
from datatable import dt, by, sort, mean, count, update, max, f, fread
df = fread('Data_files/iris.csv')
df.head()
| sepal_length | sepal_width | petal_length | petal_width | species |
| ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5 | 3.6 | 1.4 | 0.2 | setosa |
5 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
7 | 5 | 3.4 | 1.5 | 0.2 | setosa |
8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
9 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
Number of unique observations per column
df.nunique()
| sepal_length | sepal_width | petal_length | petal_width | species |
| ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
0 | 35 | 23 | 43 | 22 | 3 |
Mean of all columns by species
df[:, mean(f[:]), by('species')]
| species | sepal_length | sepal_width | petal_length | petal_width |
| ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
0 | setosa | 5.006 | 3.428 | 1.462 | 0.246 |
1 | versicolor | 5.936 | 2.77 | 4.26 | 1.326 |
2 | virginica | 6.588 | 2.974 | 5.552 | 2.026 |
Filtering
First two observations by species
df[:2, :, by('species')]
| species | sepal_length | sepal_width | petal_length | petal_width |
| ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
0 | setosa | 5.1 | 3.5 | 1.4 | 0.2 |
1 | setosa | 4.9 | 3 | 1.4 | 0.2 |
2 | versicolor | 7 | 3.2 | 4.7 | 1.4 |
3 | versicolor | 6.4 | 3.2 | 4.5 | 1.5 |
4 | virginica | 6.3 | 3.3 | 6 | 2.5 |
5 | virginica | 5.8 | 2.7 | 5.1 | 1.9 |
In datatable, rows are selected in the i
section after the grouping, unlike in R's data.table, where rows are selected in i
before grouping, and rows selected in the .SD
after grouping.
Last two observations by species
df[-2:, :, by('species')]
| species | sepal_length | sepal_width | petal_length | petal_width |
| ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
0 | setosa | 5.3 | 3.7 | 1.5 | 0.2 |
1 | setosa | 5 | 3.3 | 1.4 | 0.2 |
2 | versicolor | 5.1 | 2.5 | 3 | 1.1 |
3 | versicolor | 5.7 | 2.8 | 4.1 | 1.3 |
4 | virginica | 6.2 | 3.4 | 5.4 | 2.3 |
5 | virginica | 5.9 | 3 | 5.1 | 1.8 |
Again, the rows are selected after grouping by using Python's negative index slicing.
Select the top two sorted by sepal length
in descending order
df[:2, :, by('species'), sort(-f.sepal_length)]
| species | sepal_length | sepal_width | petal_length | petal_width |
| ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
0 | setosa | 5.8 | 4 | 1.2 | 0.2 |
1 | setosa | 5.7 | 4.4 | 1.5 | 0.4 |
2 | versicolor | 7 | 3.2 | 4.7 | 1.4 |
3 | versicolor | 6.9 | 3.1 | 4.9 | 1.5 |
4 | virginica | 7.9 | 3.8 | 6.4 | 2 |
5 | virginica | 7.7 | 3.8 | 6.7 | 2.2 |
In datatable, the sort function replicates the order
function in R's data.table. Note the -
symbol before the sepal_length f-expression; this instructs the dataframe to sort in descending order.
Select the top two sorted by the difference between the sepal length
and sepal width
df[:2, :, by('species'), sort(f.sepal_length - f.sepal_width)]
| species | sepal_length | sepal_width | petal_length | petal_width |
| ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
0 | setosa | 4.6 | 3.6 | 1 | 0.2 |
1 | setosa | 5.2 | 4.1 | 1.5 | 0.1 |
2 | versicolor | 5.4 | 3 | 4.5 | 1.5 |
3 | versicolor | 5.2 | 2.7 | 3.9 | 1.4 |
4 | virginica | 4.9 | 2.5 | 4.5 | 1.7 |
5 | virginica | 5.6 | 2.8 | 4.9 | 2 |
Just like in R's data.table, boolean expressions can be passed to the sort function.
Filter observations above the mean of sepal_length
by species
df[:, update(temp = f.sepal_length > mean(f.sepal_length)), by('species')]
df[f.temp == 1, f[:-1]]
| sepal_length | sepal_width | petal_length | petal_width | species |
| ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
2 | 5.4 | 3.7 | 1.5 | 0.2 | setosa |
3 | 5.8 | 4 | 1.2 | 0.2 | setosa |
4 | 5.7 | 4.4 | 1.5 | 0.4 | setosa |
5 | 5.4 | 3.9 | 1.3 | 0.4 | setosa |
6 | 5.1 | 3.5 | 1.4 | 0.3 | setosa |
7 | 5.7 | 3.8 | 1.7 | 0.3 | setosa |
8 | 5.1 | 3.8 | 1.5 | 0.3 | setosa |
9 | 5.4 | 3.4 | 1.7 | 0.2 | setosa |
10 | 5.1 | 3.7 | 1.5 | 0.4 | setosa |
11 | 5.1 | 3.3 | 1.7 | 0.5 | setosa |
12 | 5.2 | 3.5 | 1.5 | 0.2 | setosa |
13 | 5.2 | 3.4 | 1.4 | 0.2 | setosa |
14 | 5.4 | 3.4 | 1.5 | 0.4 | setosa |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
63 | 6.7 | 3.1 | 5.6 | 2.4 | virginica |
64 | 6.9 | 3.1 | 5.1 | 2.3 | virginica |
65 | 6.8 | 3.2 | 5.9 | 2.3 | virginica |
66 | 6.7 | 3.3 | 5.7 | 2.5 | virginica |
67 | 6.7 | 3 | 5.2 | 2.3 | virginica |
Unlike in R's data.table, boolean expressions can not be applied within the i
section, in the presence of by
. The next best thing is to break it down into two steps - create a temporary column to hold the boolean value, and then filter on that column.
Filter rows with group size greater than 10
df[:, update(temp = count() > 10), by('species', 'petal_width')]
df[f.temp == 1, f[:-1]]
| sepal_length | sepal_width | petal_length | petal_width | species |
| ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪ |
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5 | 3.6 | 1.4 | 0.2 | setosa |
5 | 5 | 3.4 | 1.5 | 0.2 | setosa |
6 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
7 | 5.4 | 3.7 | 1.5 | 0.2 | setosa |
8 | 4.8 | 3.4 | 1.6 | 0.2 | setosa |
9 | 5.8 | 4 | 1.2 | 0.2 | setosa |
10 | 5.4 | 3.4 | 1.7 | 0.2 | setosa |
11 | 4.6 | 3.6 | 1 | 0.2 | setosa |
12 | 4.8 | 3.4 | 1.9 | 0.2 | setosa |
13 | 5 | 3 | 1.6 | 0.2 | setosa |
14 | 5.2 | 3.5 | 1.5 | 0.2 | setosa |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
48 | 6.2 | 2.8 | 4.8 | 1.8 | virginica |
49 | 6.1 | 3 | 4.9 | 1.8 | virginica |
50 | 6.4 | 3.1 | 5.5 | 1.8 | virginica |
51 | 6 | 3 | 4.8 | 1.8 | virginica |
52 | 5.9 | 3 | 5.1 | 1.8 | virginica |
Get the row with the max petal_length by species.
del df['temp']
df[0, :, by('species'), sort(-f.petal_length)]
| species | sepal_length | sepal_width | petal_length | petal_width |
| ▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
0 | setosa | 4.8 | 3.4 | 1.9 | 0.2 |
1 | versicolor | 6 | 2.7 | 5.1 | 1.6 |
2 | virginica | 7.7 | 2.6 | 6.9 | 2.3 |
In the above code, we take advantage of the fact that sorting is done within each group; this allows us to pick the first row per group when petal_length
is sorted in descending order.
.SDCols
Including columns in .SD
names = [f[name] for name in df.names
if name.startswith('sepal')]
df[:, mean(names)]
| sepal_length | sepal_width |
| ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
0 | 5.84333 | 3.05733 |
Removing columns from .SD
names = [f[name] for name in df.names
if not name.startswith(('petal','species'))]
df[:, mean(names)]
| sepal_length | sepal_width |
| ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
0 | 5.84333 | 3.05733 |
Column ranges
df[:, mean(f['sepal_length':'sepal_width'])]
| sepal_length | sepal_width |
| ▪▪▪▪▪▪▪▪ | ▪▪▪▪▪▪▪▪ |
0 | 5.84333 | 3.05733 |
Summary
We've seen how to replicate .SD
in datatable. There are other functionalities in .SD
that are not presently possible in Python's datatable. It is possible that in the future, .SD
will be implemented to allow for custom aggregation functions. That would be truly awesome, as it would allow numpy functions and functions from other Python libraries into datatable.