v1.72.0 TopCause is an algorithm that answers the question What’s the single biggest change I can make to improve my outcome?.
TopCause takes two inputs:
X: all input variables that you can change as a DataFramey: an outcome variable you want to improve as a seriesSay a Rugby team is recruiting for heavy people, and have weight data like this.
| male | age | height | weight |
|---|---|---|---|
| 1 | 90.0 | 151.7 | 47.8 |
| 0 | 90.0 | 139.7 | 36.4 |
| 0 | 90.0 | 136.5 | 31.8 |
| 1 | 20.0 | 156.8 | 53.0 |
| 0 | 10.0 | 145.4 | 41.2 |
See the data
If they want to know What’s the single biggest driver of weight?, TopCause can answer that.
Here is topcausecalc.py which has a FunctionHandler that returns the drivers.
import gramex.ml
import gramex.cache
from gramex.transforms import handler
@handler
def drivers():
data = gramex.cache.open('weight.csv')
model = gramex.ml.TopCause()
model.fit(data, data['weight'])
return model.result_
To set this up, use this gramex.yaml:
url:
topcause-drivers:
pattern: drivers
handler: FunctionHandler
kwargs:
function: topcausecalc.drivers
See the drivers
The result in model.result_ is a DataFrame. For every column (feature) in X, there is a row in
the result that shows the impact of that feature.
Here is a sample row
| value | gain | p | type | |
|---|---|---|---|---|
| height | 164.5 | 12.7 | 8.4e-13 | num |
The columns show:
height of 164.5cm)weight by 12.7 kg)height does not impact weight)num or cat)The above example returns:
| value | gain | p | type | |
|---|---|---|---|---|
| weight | 55.0 | 16.9 | 1.8e-267 | num |
| height | 164.5 | 12.7 | 8.4e-13 | num |
| male | NaN | NaN | 0.057 | num |
| age | NaN | NaN | 0.453 | num |
This example says that:
Summary: Recruiting tall people (~164cm) can increase team weight by ~12.7kg
The constructor gramex.ml.TopCause() accepts these parameters:
max_p: float - maximum allowed probability of error (default: 0.05).max_p=1max_p=0.01percentile: float - ignore high-performing outliers beyond this percentile (default: 0.95)percentile=1percentile=0.95min_weight: int - minimum samples in a group. Drop groups with fewer (default: 3)min_weight=5min_weight=0