Machine learning

`gramex.ml` ¶

`Classifier(kwargs)` ¶

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	data to train / re-train the model with	required
`model_class`	`str`	model class to use (default: `sklearn.naive_bayes.BernoulliNB`)	required
`model_kwargs`	`dict`	kwargs to pass to model class constructor (defaults: `{}`)	required
`output`	`str`	output column name (default: last column in training data)	required
`input`	`list`	input column names (default: all columns except `output`)	required
`labels`	`list`	list of possible output values (default: unique `output` in training)	required

Source code in gramex\ml.py

def __init__(self, **kwargs):
    '''
    Parameters:

        data DataFrame: data to train / re-train the model with
        model_class str: model class to use (default: `sklearn.naive_bayes.BernoulliNB`)
        model_kwargs dict: kwargs to pass to model class constructor (defaults: `{}`)
        output str: output column name (default: last column in training data)
        input list: input column names (default: all columns except `output`)
        labels list: list of possible output values (default: unique `output` in training)
    '''

    vars(self).update(kwargs)
    self.model_class = kwargs.get('model_class', 'sklearn.naive_bayes.BernoulliNB')
    self.trained = False  # Boolean Flag

`train(data)` ¶

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	data to train / re-train the model with	required
`model_class`	`str`	model class to use (default: `sklearn.naive_bayes.BernoulliNB`)	required
`model_kwargs`	`dict`	kwargs to pass to model class constructor (defaults: `{}`)	required
`output`	`str`	output column name (default: last column in training data)	required
`input`	`list`	input column names (default: all columns except `output`)	required
`labels`	`list`	list of possible output values (default: unique `output` in training)	required

Notes: - If model has already been trained, extend the model. Else create it

Source code in gramex\ml.py

def train(self, data: pd.DataFrame):
    '''
    Parameters:

        data DataFrame: data to train / re-train the model with
        model_class str: model class to use (default: `sklearn.naive_bayes.BernoulliNB`)
        model_kwargs dict: kwargs to pass to model class constructor (defaults: `{}`)
        output str: output column name (default: last column in training data)
        input list: input column names (default: all columns except `output`)
        labels list: list of possible output values (default: unique `output` in training)

    Notes:
    - If model has already been trained, extend the model. Else create it
    '''
    self.output = vars(self).get('output', data.columns[-1])
    self.input = vars(self).get('input', list(data.columns[:-1]))
    self.model_kwargs = vars(self).get('model_kwargs', {})
    self.labels = vars(self).get('labels', None)
    # If model_kwargs have changed since we trained last, re-train model.
    if not self.trained and hasattr(self, 'model'):
        vars(self).pop('model')
    if not hasattr(self, 'model'):
        # Split it into input (x) and output (y)
        x, y = data[self.input], data[self.output]
        # Transform the data
        from sklearn.preprocessing import StandardScaler

        self.scaler = StandardScaler()
        self.scaler.fit(x)
        # Train the classifier. Partially, if possible
        try:
            clf = locate(self.model_class)(**self.model_kwargs)
        except TypeError:
            raise ValueError('{0} is not a correct model class'.format(self.model_class))
        if self.labels and hasattr(clf, 'partial_fit'):
            try:
                clf.partial_fit(self.scaler.transform(x), y, classes=self.labels)
            except AttributeError:
                raise ValueError('{0} does not support partial fit'.format(self.model_class))
        else:
            clf.fit(self.scaler.transform(x), y)
        self.model = clf
    # Extend the model
    else:
        x, y = data[self.input], data[self.output]
        classes = set(self.model.classes_)
        classes |= set(y)
        self.model.partial_fit(self.scaler.transform(x), y)
    self.trained = True

`predict(data)` ¶

Return a Series that has the results of the classification of data

Source code in gramex\ml.py

def predict(self, data):
    '''
    Return a Series that has the results of the classification of data
    '''
    # Convert list of lists or numpy arrays into DataFrame. Assume columns are as per input
    if not isinstance(data, pd.DataFrame):
        data = pd.DataFrame(data, columns=self.input)
    # Take only trained input columns
    return self.model.predict(self.scaler.transform(data))

`save(path)` ¶

Serializes the model and associated parameters

Source code in gramex\ml.py

def save(self, path):
    '''
    Serializes the model and associated parameters
    '''
    joblib.dump(self, path, compress=9)

`r(code=None, path=None, rel=True, conda=True, convert=True, repo='https://cran.r-project.org/', kwargs)` ¶

Runs the R script and returns the result.

Parameters:

Name	Type	Description	Default
`code`	`str`	R code to execute.	`None`
`path`	`str`	R script path. Cannot be used if code is specified	`None`
`rel`	`bool`	True treats path as relative to the caller function’s file	`True`
`conda`	`bool`	True overrides R_HOME to use the Conda R	`True`
`convert`	`bool`	True converts R objects to Pandas and vice versa	`True`
`repo`	`str`	CRAN repo URL	`'https://cran.r-project.org/'`

All other keyword arguments as passed as parameters

Source code in gramex\ml.py

def r(
    code: str = None,
    path: str = None,
    rel: bool = True,
    conda: bool = True,
    convert: bool = True,
    repo: str = 'https://cran.r-project.org/',
    **kwargs,
):
    '''
    Runs the R script and returns the result.

    Parameters:

        code: R code to execute.
        path: R script path. Cannot be used if code is specified
        rel: True treats path as relative to the caller function's file
        conda: True overrides R_HOME to use the Conda R
        convert: True converts R objects to Pandas and vice versa
        repo: CRAN repo URL

    All other keyword arguments as passed as parameters
    '''
    # Use Conda R if possible
    if conda:
        r_home = _conda_r_home()
        if r_home:
            os.environ['R_HOME'] = r_home

    # Import the global R session
    try:
        from rpy2.robjects import r, pandas2ri, globalenv
    except ImportError:
        app_log.error('rpy2 not installed. Run "conda install rpy2"')
        raise
    except RuntimeError:
        app_log.error('Cannot find R. Set R_HOME env variable')
        raise

    # Set a repo so that install.packages() need not ask for one
    r('local({r <- getOption("repos"); r["CRAN"] <- "%s"; options(repos = r)})' % repo)

    # Activate or de-activate automatic conversion
    # https://pandas.pydata.org/pandas-docs/version/0.22.0/r_interface.html
    if convert:
        pandas2ri.activate()
    else:
        pandas2ri.deactivate()

    # Pass all other kwargs as global environment variables
    for key, val in kwargs.items():
        globalenv[key] = val

    if code and path:
        raise RuntimeError('Use r(code=) or r(path=...), not both')
    if path:
        # if rel=True, load path relative to parent directory
        if rel:
            stack = inspect.getouterframes(inspect.currentframe(), 2)
            folder = os.path.dirname(os.path.abspath(stack[1][1]))
            path = os.path.join(folder, path)
        result = r.source(path, chdir=True)
        # source() returns a withVisible: $value and $visible. Use only the first
        result = result[0]
    else:
        result = r(code)

    return result

`groupmeans(data, groups, numbers, cutoff=0.01, quantile=0.95, minsize=None, weight=None)` ¶

DEPRECATED. Use TopCause() instead.

Yields the significant differences in average between every pair of groups and numbers.

Parameters:

Name	Type	Description	Default
`data`	`pd.DataFrame`	pandas.DataFrame to analyze	required
`groups`	`list`	category column names to group data by	required
`numbers`	`list`	numeric column names in to summarize data by	required
`cutoff`	`float`	ignore anything with prob > cutoff. cutoff=None ignores significance checks, speeding it up a LOT.	`0.01`
`float`	`quantile`	number that represents target improvement. Defaults to .95. The `diff` returned is the % impact of everyone moving to the 95th percentile	required
`int`	`minsize`	each group should contain at least minsize values. If minsize=None, automatically set the minimum size to 1% of the dataset, or 10, whichever is larger.	required

Source code in gramex\ml.py

def groupmeans(
    data: pd.DataFrame,
    groups: list,
    numbers: list,
    cutoff: float = 0.01,
    quantile: float = 0.95,
    minsize: int = None,
    weight: str = None,
):
    '''
    **DEPRECATED**. Use TopCause() instead.

    Yields the significant differences in average between every pair of
    groups and numbers.

    Parameters:

        data: pandas.DataFrame to analyze
        groups: category column names to group data by
        numbers: numeric column names in to summarize data by
        cutoff: ignore anything with prob > cutoff.
            cutoff=None ignores significance checks, speeding it up a LOT.
        float quantile: number that represents target improvement. Defaults to .95.
            The `diff` returned is the % impact of everyone moving to the 95th
            percentile
        int minsize: each group should contain at least minsize values.
            If minsize=None, automatically set the minimum size to
            1% of the dataset, or 10, whichever is larger.
    '''
    from scipy.stats.mstats import ttest_ind

    if minsize is None:
        minsize = max(len(data.index) // 100, 10)

    if weight is None:
        means = data[numbers].mean()
    else:
        means = weighted_avg(data, numbers, weight)
    results = []
    for group in groups:
        grouped = data.groupby(group, sort=False)
        if weight is None:
            ave = grouped[numbers].mean()
        else:
            ave = grouped.apply(lambda v: weighted_avg(v, numbers, weight))
        ave['#'] = sizes = grouped.size()
        # Each group should contain at least minsize values
        biggies = sizes[sizes >= minsize].index
        # ... and at least 2 groups overall, to compare.
        if len(biggies) < 2:
            continue
        for number in numbers:
            if number == group:
                continue
            sorted_cats = ave[number][biggies].dropna().sort_values()
            if len(sorted_cats) < 2:
                continue
            lo = data[number][grouped.groups[sorted_cats.index[0]]].values
            hi = data[number][grouped.groups[sorted_cats.index[-1]]].values
            _, prob = ttest_ind(
                np.ma.masked_array(lo, np.isnan(lo)), np.ma.masked_array(hi, np.isnan(hi))
            )
            if prob > cutoff:
                continue
            results.append(
                {
                    'group': group,
                    'number': number,
                    'prob': prob,
                    'gain': sorted_cats.iloc[-1] / means[number] - 1,
                    'biggies': ave.loc[biggies][number].to_dict(),
                    'means': ave[[number, '#']].sort_values(number).to_dict(),
                }
            )

    results = pd.DataFrame(results)
    if len(results) > 0:
        results = results.set_index(['group', 'number'])
    return results.reset_index()  # Flatten multi-index.

`weighted_avg(data, numeric_cols, weight)` ¶

Computes weighted average for specificied columns

Source code in gramex\ml.py

def weighted_avg(data, numeric_cols, weight):
    '''
    Computes weighted average for specificied columns
    '''
    sumprod = data[numeric_cols].multiply(data[weight], axis=0).sum()
    return sumprod / data[weight].sum()

`translate(q, source=None, target=None, key=None, cache=None, api='google', kwargs)` ¶

Translate strings using the Google Translate API.

translate('Hello', 'World', source='en', target='de', key='...')

returns a DataFrame

source  target  q       t
en      de      Hello   ...
en      de      World   ...

The results can be cached via a cache={...} that has parameters for [gramex.data.filter]. Example:

translate('Hello', key='...', cache={'url': 'translate.xlsx'})

Parameters:

Name	Type	Description	Default
`q`	`str`	one or more strings to translate	`()`
`source`	`str`	2-letter source language (e.g. en, fr, es, hi, cn, etc).	`None`
`target`	`str`	2-letter target language (e.g. en, fr, es, hi, cn, etc).	`None`
`key`	`str`	Google Translate API key	`None`
`cache`	`dict`	kwargs for [gramex.data.filter]. Has keys such as url (required), table (for databases), sheet_name (for Excel), etc.	`None`

Reference: https://cloud.google.com/translate/docs/apis

Source code in gramex\ml.py

def translate(
    *q: str,
    source: str = None,
    target: str = None,
    key: str = None,
    cache: dict = None,
    api: str = 'google',
    **kwargs,
):
    '''
    Translate strings using the Google Translate API.

    ```python
    translate('Hello', 'World', source='en', target='de', key='...')
    ```

    returns a DataFrame

    ```text
    source  target  q       t
    en      de      Hello   ...
    en      de      World   ...
    ```

    The results can be cached via a `cache={...}` that has parameters for
    [gramex.data.filter]. Example:

    ```python
    translate('Hello', key='...', cache={'url': 'translate.xlsx'})
    ```

    Parameters:

        q: one or more strings to translate
        source: 2-letter source language (e.g. en, fr, es, hi, cn, etc).
        target: 2-letter target language (e.g. en, fr, es, hi, cn, etc).
        key: Google Translate API key
        cache: kwargs for [gramex.data.filter]. Has keys such as
            url (required), table (for databases), sheet_name (for Excel), etc.

    Reference: https://cloud.google.com/translate/docs/apis
    '''
    import gramex.data

    if cache is not None and not isinstance(cache, dict):
        raise ValueError('cache= must be a FormHandler dict config, not %r' % cache)

    # Store data in cache with fixed columns: source, target, q, t
    result = pd.DataFrame(columns=['source', 'target', 'q', 't'])
    if not q:
        return result
    original_q = q

    # Fetch from cache, if any
    if cache:
        try:
            args = {'q': q, 'target': [target] * len(q)}
            if source:
                args['source'] = [source] * len(q)
            with _translate_cache_lock:
                result = gramex.data.filter(args=args, **cache)
        except Exception:
            app_log.exception('Cannot query %r in translate cache: %r', args, dict(cache))
        # Remove already cached  results from q
        q = [v for v in q if v not in set(result.get('q', []))]

    if len(q):
        new_data = translate_api[api](q, source, target, key)
        if new_data is not None:
            result = result.append(pd.DataFrame(new_data), sort=False)
            if cache:
                with _translate_cache_lock:
                    gramex.data.insert(id=['source', 'target', 'q'], args=new_data, **cache)

    # Sort results by q
    result['order'] = result['q'].map(original_q.index)
    result.sort_values('order', inplace=True)
    result.drop_duplicates(subset=['q'], inplace=True)
    del result['order']

    return result

`languagetoolrequest(text, lang='en-us', kwargs)` ¶

Check grammar by making a request to the LanguageTool server.

Parameters¶

str

Text to check

str, optional

Language. See a list of supported languages here: https://languagetool.org/api/v2/languages

Source code in gramex\ml.py

@coroutine
def languagetoolrequest(text, lang='en-us', **kwargs):
    '''Check grammar by making a request to the LanguageTool server.

    Parameters
    ----------
    text : str
        Text to check
    lang : str, optional
        Language. See a list of supported languages here: https://languagetool.org/api/v2/languages
    '''
    client = AsyncHTTPClient()
    url = kwargs['LT_URL'].format(**kwargs)
    query = urlencode({'language': lang, 'text': text})
    url = url + query
    tries = 2  # See: https://github.com/gramener/gramex/pull/125#discussion_r266200480
    while tries:
        try:
            result = yield client.fetch(url)
            tries = 0
        except ConnectionRefusedError:
            # Start languagetool
            from gramex.cache import daemon

            cmd = [p.format(**kwargs) for p in kwargs['LT_CMD']]
            app_log.info('Starting: %s', ' '.join(cmd))
            if 'proc' not in _languagetool:
                import re

                _languagetool['proc'] = daemon(
                    cmd,
                    cwd=kwargs['LT_CWD'],
                    first_line=re.compile(r"Server started\s*$"),
                    stream=True,
                    timeout=5,
                    buffer_size=512,
                )
            try:
                result = yield client.fetch(url)
                tries = 0
            except ConnectionRefusedError:
                yield sleep(1)
                tries -= 1
    raise Return(result.body)

gramex.ml ¶

Classifier(kwargs) ¶

train(data) ¶

predict(data) ¶

save(path) ¶

r(code=None, path=None, rel=True, conda=True, convert=True, repo='https://cran.r-project.org/', kwargs) ¶

groupmeans(data, groups, numbers, cutoff=0.01, quantile=0.95, minsize=None, weight=None) ¶

weighted_avg(data, numeric_cols, weight) ¶

translate(q, source=None, target=None, key=None, cache=None, api='google', kwargs) ¶

languagetoolrequest(text, lang='en-us', kwargs) ¶

Parameters¶

`gramex.ml` ¶

`Classifier(kwargs)` ¶

`train(data)` ¶

`predict(data)` ¶

`save(path)` ¶

`r(code=None, path=None, rel=True, conda=True, convert=True, repo='https://cran.r-project.org/', kwargs)` ¶

`groupmeans(data, groups, numbers, cutoff=0.01, quantile=0.95, minsize=None, weight=None)` ¶

`weighted_avg(data, numeric_cols, weight)` ¶

`translate(q, source=None, target=None, key=None, cache=None, api='google', kwargs)` ¶

`languagetoolrequest(text, lang='en-us', kwargs)` ¶