LDA Explanation¶

class lda_explainer.LDA_Explainer(num_topics)¶

An LDA wrapper for explaining a predictor’s predictions. Currently supports only binary predictors. Optionally supports domain-ruled data (see methods API and demo).

Parameters

num_topicsint: Number of topics of the LDA model.

Attributes

num_topicsint: Number of LDA topics that was passed at initialization.
ldagensim.models.LdaModel: The underlying LDA model. None before fit() is called.
domain_nameslist of str: The domain names that were either given to fit() or defaulted to numbering. None before fit() is called in any case.
domain_labelslist of int: The domain names that were either given to fit() (or None if not given). None before fit() is called.
model_confidencelist of float: Model confidence that was passed to fit(). None before fit() is called.
doc_topic_mx2-D numpy.ndarray of type float: A (num_documents, num_topics) matrix that describe each document given to fit() as a mixture of topics. None before fit() is called.
sep_topicslist of tuple triplets: Seperating topics chosen for each domain and their scores. See documentation of fit() for further explanation. A list of (domain_name, seperating_topic_number, topic_score) triplets. Always contains the “All” domain (even when no domains are given). None before fit() is called.

Methods

`display_seperating_topics`([topn])	Presents the seperating topics (one topic if no domains are given) in a pandas DataFrame.
`display_topics`([topn, colors])	Presents all topics in a pandas DataFrame.
`fit`(texts, model_confidence[, …])	Fits the explaining LDA model and evaluate topics (possibly for each domain).
`load`(fname)	Loads a saved model from files with the prefix fname.
`plot_topic_confidence_trends`([save_fname, …])	Plots topic-confidence trend line for all documents and for each domain.
`plot_topics_dominance`([save_fstr])	Plots the average dominance of each topic in each domain (and in all domains).
`save`(fname)	Saves the model to files with the prefix fname.

display_seperating_topics(topn=15)¶

Presents the seperating topics (one topic if no domains are given) in a pandas DataFrame.

Parameters

topnint: Number of words to present for each topic (default = 15).

Returns

pandas.DataFrame: The presentive DataFrame. Each row corresponds to a domain (including “All”). Each row consist of the domain name as the index, the topic number, its score and topn entries with the topn top words of the topic.

Notes

Topics are numbered from 1 to num_topics.

See fit for explanation on the score and its sign.

display_topics(topn=15, colors=None)¶

Presents all topics in a pandas DataFrame.

Parameters

topnint: Number of words to present for each topic (default = 15).
colorsdict of (str, list), optional: Instructions for coloring specific rows in specific colors. If n is in colors[c], then the font in the row of topic n will be colored in c.

Returns

pandas.io.formats.style.Styler: The presentive table. Each row corresponds to a topic. Each row consist of the topic name (number) as the index and topn entries with the topn top words of the topic. The table is captioned f”Top {topn} Words”.

Notes

Topics are numbered from 1 to num_topics.

fit(texts, model_confidence, domain_labels=None, domain_names=None)¶

Fits the explaining LDA model and evaluate topics (possibly for each domain). The support for domains is completely optional and seems integral since the class was designed for a domain-ruled data.

Parameters

textslist of str: List (or array-like) of the classified texts (as strings). This is used as input to the LDA model and maybe to the explained model as well.
model_confidencelist of float: List (or array-like) containing the confidence of the explained model that the text is classified positively (1), for each text.
domain_labelslist of int, optional: Optional list of the domain label for each entry. If given, should contain values in {0, 1, …, num_domains}.
domain_nameslist of str, optional: Optional list of the domain names, ignored if domain_labels is not given. Should be of length numpy.max(domain_labels) + 1. The name in index i corresponds to the domain i in domain_labels. If domain_labels is given and domain_names is not, simple numbering is used. Cannot contain “All”, as it is saved for all the domains.

Returns

LDA_explainer: self

Raises

ValueError: If any of the parameters given is not as specified.
RuntimeError: If model is already fit (e.g., if fit() is called twice or if a loaded model is fit).

Notes

Preprocesses the texts (lower casing, removing punctuations, stopwords and words of length <3) before fitting the LDA model, but input for model (if given) is passed as-is.

If \(Z\) is the group of all topics, the seperating topic (for each domain) is chosen by

\[z^{sep} = \arg\max_{z \in Z} \left| \sum_{i \in I_{test}} \hat{y}_i \theta_z^i \right|\]

Where \(\hat{y}_i\) is the prediction of the explained model for the \(i^{th}\) document (1 for positive class and -1 for negative class) and \(\theta_z^i\) is the probability of topic \(z\) in document \(i\).

Note that the sign of the score (without absolute value) is saved.

This definition induces symmetry between positive and negative classes.

While model confidence are required, currently only the predictions are used in practice. Model confidence can therefore be replaced with predictions (0 or 1).

classmethod load(fname)¶

Loads a saved model from files with the prefix fname.

Parameters

fnamestr: Optional directory + file names prefix. See save() for example.

Returns

LDA_explainer: The loaded model.

Raises

FileNotFoundError: If one or more of the files is not found.

plot_topic_confidence_trends(save_fname=None, colors=None)¶

Plots topic-confidence trend line for all documents and for each domain.

Parameters

save_fnamestr, optional: The file name and path for the saved figure (in PNG format). If not specified, the figure will not be saved.
colorsOrderedDict of (str, str) or list of str, optional: A (color_name : color) dictionary or just colors list. These colors determine the trend line colors of the domains. The “All” trend line is always black. If there are more domains than colors, colors are reused rotationally. (default = matplotlib.colors.TABLEAU_COLORS)

Returns

matplotlib.figure.Figure: The figure in which the trend lines are plotted.

Notes

The figure is generated as in [1]: If \(\theta^i_z\) is the probability of topic \(z\) in document \(i\), then for each \(j \in J = \{0.1, 0.2, ..., 1\}\) we take the average confidence of the explained model over \(I^j := \{i|\theta^i \in (j-0.1, j]\}\), i.e.,

\[f(\theta; j) = \frac{\sum_{i \in I^j} \hat{p}_i}{|I^j|}\]

Where \(\hat{p}_i\) is the confidence of the model that document \(i\) belongs to the positive class.

If there are more than 10 domains and no color scheme is specified, colors will be reused for trend lines.

References

1: Oved, N., Feder, A. and Reichart, R., 2020. Predicting In-Game Actions from Interviews of NBA Players. Computational Linguistics, pp.1-46.

plot_topics_dominance(save_fstr=None)¶

Plots the average dominance of each topic in each domain (and in all domains). Draws a seperate lolipop chart for each domain and one for all domains.

Parameters

save_fstrstr, optional: A formatable string for saving the figures. Must contain “%s”. E.g., “topic_dominance_%s.png”. Domain names will replace “%s”.

Returns

list of matplotlib.figure.Figure: List of the drawn figures.

Notes

Width of the figures is 6 * num_topics / 20 inches, height is 4 inches.

“.png” suffix is added to save_fstr if missing.

save(fname)¶

Saves the model to files with the prefix fname.

Parameters

fnamestr: Optional directory + file names prefix. For example, if fname = “./saved_models/explainer”, multiple files in “./saved_models/” will be written with the prefix “explainer”.

Raises

RuntimeError: If fit() was not called for the model earlier.

Notes

9 files are saved.