LDA Explanation

class lda_explainer.LDA_Explainer(num_topics)

An LDA wrapper for explaining a predictor’s predictions. Currently supports only binary predictors. Optionally supports domain-ruled data (see methods API and demo).

Parameters
num_topicsint

Number of topics of the LDA model.

Attributes
num_topicsint

Number of LDA topics that was passed at initialization.

ldagensim.models.LdaModel

The underlying LDA model. None before fit() is called.

domain_nameslist of str

The domain names that were either given to fit() or defaulted to numbering. None before fit() is called in any case.

domain_labelslist of int

The domain names that were either given to fit() (or None if not given). None before fit() is called.

model_confidencelist of float

Model confidence that was passed to fit(). None before fit() is called.

doc_topic_mx2-D numpy.ndarray of type float

A (num_documents, num_topics) matrix that describe each document given to fit() as a mixture of topics. None before fit() is called.

sep_topicslist of tuple triplets

Seperating topics chosen for each domain and their scores. See documentation of fit() for further explanation. A list of (domain_name, seperating_topic_number, topic_score) triplets. Always contains the “All” domain (even when no domains are given). None before fit() is called.

Methods

display_seperating_topics([topn])

Presents the seperating topics (one topic if no domains are given) in a pandas DataFrame.

display_topics([topn, colors])

Presents all topics in a pandas DataFrame.

fit(texts, model_confidence[, …])

Fits the explaining LDA model and evaluate topics (possibly for each domain).

load(fname)

Loads a saved model from files with the prefix fname.

plot_topic_confidence_trends([save_fname, …])

Plots topic-confidence trend line for all documents and for each domain.

plot_topics_dominance([save_fstr])

Plots the average dominance of each topic in each domain (and in all domains).

save(fname)

Saves the model to files with the prefix fname.

display_seperating_topics(topn=15)

Presents the seperating topics (one topic if no domains are given) in a pandas DataFrame.

Parameters
topnint

Number of words to present for each topic (default = 15).

Returns
pandas.DataFrame

The presentive DataFrame. Each row corresponds to a domain (including “All”). Each row consist of the domain name as the index, the topic number, its score and topn entries with the topn top words of the topic.

Notes

Topics are numbered from 1 to num_topics.

See fit for explanation on the score and its sign.

display_topics(topn=15, colors=None)

Presents all topics in a pandas DataFrame.

Parameters
topnint

Number of words to present for each topic (default = 15).

colorsdict of (str, list), optional

Instructions for coloring specific rows in specific colors. If n is in colors[c], then the font in the row of topic n will be colored in c.

Returns
pandas.io.formats.style.Styler

The presentive table. Each row corresponds to a topic. Each row consist of the topic name (number) as the index and topn entries with the topn top words of the topic. The table is captioned f”Top {topn} Words”.

Notes

Topics are numbered from 1 to num_topics.

fit(texts, model_confidence, domain_labels=None, domain_names=None)

Fits the explaining LDA model and evaluate topics (possibly for each domain). The support for domains is completely optional and seems integral since the class was designed for a domain-ruled data.

Parameters
textslist of str

List (or array-like) of the classified texts (as strings). This is used as input to the LDA model and maybe to the explained model as well.

model_confidencelist of float

List (or array-like) containing the confidence of the explained model that the text is classified positively (1), for each text.

domain_labelslist of int, optional

Optional list of the domain label for each entry. If given, should contain values in {0, 1, …, num_domains}.

domain_nameslist of str, optional

Optional list of the domain names, ignored if domain_labels is not given. Should be of length numpy.max(domain_labels) + 1. The name in index i corresponds to the domain i in domain_labels. If domain_labels is given and domain_names is not, simple numbering is used. Cannot contain “All”, as it is saved for all the domains.

Returns
LDA_explainer

self

Raises
ValueError

If any of the parameters given is not as specified.

RuntimeError

If model is already fit (e.g., if fit() is called twice or if a loaded model is fit).

Notes

Preprocesses the texts (lower casing, removing punctuations, stopwords and words of length <3) before fitting the LDA model, but input for model (if given) is passed as-is.

If \(Z\) is the group of all topics, the seperating topic (for each domain) is chosen by

\[z^{sep} = \arg\max_{z \in Z} \left| \sum_{i \in I_{test}} \hat{y}_i \theta_z^i \right|\]

Where \(\hat{y}_i\) is the prediction of the explained model for the \(i^{th}\) document (1 for positive class and -1 for negative class) and \(\theta_z^i\) is the probability of topic \(z\) in document \(i\).

Note that the sign of the score (without absolute value) is saved.

This definition induces symmetry between positive and negative classes.

While model confidence are required, currently only the predictions are used in practice. Model confidence can therefore be replaced with predictions (0 or 1).

classmethod load(fname)

Loads a saved model from files with the prefix fname.

Parameters
fnamestr

Optional directory + file names prefix. See save() for example.

Returns
LDA_explainer

The loaded model.

Raises
FileNotFoundError

If one or more of the files is not found.

Plots topic-confidence trend line for all documents and for each domain.

Parameters
save_fnamestr, optional

The file name and path for the saved figure (in PNG format). If not specified, the figure will not be saved.

colorsOrderedDict of (str, str) or list of str, optional

A (color_name : color) dictionary or just colors list. These colors determine the trend line colors of the domains. The “All” trend line is always black. If there are more domains than colors, colors are reused rotationally. (default = matplotlib.colors.TABLEAU_COLORS)

Returns
matplotlib.figure.Figure

The figure in which the trend lines are plotted.

Notes

The figure is generated as in [1]: If \(\theta^i_z\) is the probability of topic \(z\) in document \(i\), then for each \(j \in J = \{0.1, 0.2, ..., 1\}\) we take the average confidence of the explained model over \(I^j := \{i|\theta^i \in (j-0.1, j]\}\), i.e.,

\[f(\theta; j) = \frac{\sum_{i \in I^j} \hat{p}_i}{|I^j|}\]

Where \(\hat{p}_i\) is the confidence of the model that document \(i\) belongs to the positive class.

If there are more than 10 domains and no color scheme is specified, colors will be reused for trend lines.

References

1

Oved, N., Feder, A. and Reichart, R., 2020. Predicting In-Game Actions from Interviews of NBA Players. Computational Linguistics, pp.1-46.

plot_topics_dominance(save_fstr=None)

Plots the average dominance of each topic in each domain (and in all domains). Draws a seperate lolipop chart for each domain and one for all domains.

Parameters
save_fstrstr, optional

A formatable string for saving the figures. Must contain “%s”. E.g., “topic_dominance_%s.png”. Domain names will replace “%s”.

Returns
list of matplotlib.figure.Figure

List of the drawn figures.

Notes

Width of the figures is 6 * num_topics / 20 inches, height is 4 inches.

“.png” suffix is added to save_fstr if missing.

save(fname)

Saves the model to files with the prefix fname.

Parameters
fnamestr

Optional directory + file names prefix. For example, if fname = “./saved_models/explainer”, multiple files in “./saved_models/” will be written with the prefix “explainer”.

Raises
RuntimeError

If fit() was not called for the model earlier.

Notes

9 files are saved.