v1.72.0 TopCause is an algorithm that answers the question What’s the single biggest change I can make to improve my outcome?.
TopCause takes two inputs:
X
: all input variables that you can change as a DataFramey
: an outcome variable you want to improve as a seriesSay a Rugby team is recruiting for heavy people, and have weight data like this.
male | age | height | weight |
---|---|---|---|
1 | 90.0 | 151.7 | 47.8 |
0 | 90.0 | 139.7 | 36.4 |
0 | 90.0 | 136.5 | 31.8 |
1 | 20.0 | 156.8 | 53.0 |
0 | 10.0 | 145.4 | 41.2 |
See the data
If they want to know What’s the single biggest driver of weight?, TopCause can answer that.
Here is topcausecalc.py
which has a FunctionHandler that returns the drivers.
import gramex.ml
import gramex.cache
from gramex.transforms import handler
@handler
def drivers():
data = gramex.cache.open('weight.csv')
model = gramex.ml.TopCause()
model.fit(data, data['weight'])
return model.result_
To set this up, use this gramex.yaml
:
url:
topcause-drivers:
pattern: drivers
handler: FunctionHandler
kwargs:
function: topcausecalc.drivers
See the drivers
The result in model.result_
is a DataFrame. For every column (feature) in X
, there is a row in
the result that shows the impact of that feature.
Here is a sample row
value | gain | p | type | |
---|---|---|---|---|
height | 164.5 | 12.7 | 8.4e-13 | num |
The columns show:
height
of 164.5cm)weight
by 12.7 kg)height
does not impact weight)num
or cat
)The above example returns:
value | gain | p | type | |
---|---|---|---|---|
weight | 55.0 | 16.9 | 1.8e-267 | num |
height | 164.5 | 12.7 | 8.4e-13 | num |
male | NaN | NaN | 0.057 | num |
age | NaN | NaN | 0.453 | num |
This example says that:
Summary: Recruiting tall people (~164cm) can increase team weight by ~12.7kg
The constructor gramex.ml.TopCause()
accepts these parameters:
max_p
: float - maximum allowed probability of error (default: 0.05
).max_p=1
max_p=0.01
percentile
: float - ignore high-performing outliers beyond this percentile (default: 0.95
)percentile=1
percentile=0.95
min_weight
: int - minimum samples in a group. Drop groups with fewer (default: 3
)min_weight=5
min_weight=0