LDA Explanation¶
-
class
lda_explainer.
LDA_Explainer
(num_topics)¶ An LDA wrapper for explaining a predictor’s predictions. Currently supports only binary predictors. Optionally supports domain-ruled data (see methods API and demo).
- Parameters
- num_topicsint
Number of topics of the LDA model.
- Attributes
- num_topicsint
Number of LDA topics that was passed at initialization.
- ldagensim.models.LdaModel
The underlying LDA model. None before fit() is called.
- domain_nameslist of str
The domain names that were either given to fit() or defaulted to numbering. None before fit() is called in any case.
- domain_labelslist of int
The domain names that were either given to fit() (or None if not given). None before fit() is called.
- model_confidencelist of float
Model confidence that was passed to fit(). None before fit() is called.
- doc_topic_mx2-D numpy.ndarray of type float
A (num_documents, num_topics) matrix that describe each document given to fit() as a mixture of topics. None before fit() is called.
- sep_topicslist of tuple triplets
Seperating topics chosen for each domain and their scores. See documentation of fit() for further explanation. A list of (domain_name, seperating_topic_number, topic_score) triplets. Always contains the “All” domain (even when no domains are given). None before fit() is called.
Methods
display_seperating_topics
([topn])Presents the seperating topics (one topic if no domains are given) in a pandas DataFrame.
display_topics
([topn, colors])Presents all topics in a pandas DataFrame.
fit
(texts, model_confidence[, …])Fits the explaining LDA model and evaluate topics (possibly for each domain).
load
(fname)Loads a saved model from files with the prefix fname.
plot_topic_confidence_trends
([save_fname, …])Plots topic-confidence trend line for all documents and for each domain.
plot_topics_dominance
([save_fstr])Plots the average dominance of each topic in each domain (and in all domains).
save
(fname)Saves the model to files with the prefix fname.
-
display_seperating_topics
(topn=15)¶ Presents the seperating topics (one topic if no domains are given) in a pandas DataFrame.
- Parameters
- topnint
Number of words to present for each topic (default = 15).
- Returns
- pandas.DataFrame
The presentive DataFrame. Each row corresponds to a domain (including “All”). Each row consist of the domain name as the index, the topic number, its score and topn entries with the topn top words of the topic.
Notes
Topics are numbered from 1 to num_topics.
See fit for explanation on the score and its sign.
-
display_topics
(topn=15, colors=None)¶ Presents all topics in a pandas DataFrame.
- Parameters
- topnint
Number of words to present for each topic (default = 15).
- colorsdict of (str, list), optional
Instructions for coloring specific rows in specific colors. If n is in colors[c], then the font in the row of topic n will be colored in c.
- Returns
- pandas.io.formats.style.Styler
The presentive table. Each row corresponds to a topic. Each row consist of the topic name (number) as the index and topn entries with the topn top words of the topic. The table is captioned f”Top {topn} Words”.
Notes
Topics are numbered from 1 to num_topics.
-
fit
(texts, model_confidence, domain_labels=None, domain_names=None)¶ Fits the explaining LDA model and evaluate topics (possibly for each domain). The support for domains is completely optional and seems integral since the class was designed for a domain-ruled data.
- Parameters
- textslist of str
List (or array-like) of the classified texts (as strings). This is used as input to the LDA model and maybe to the explained model as well.
- model_confidencelist of float
List (or array-like) containing the confidence of the explained model that the text is classified positively (1), for each text.
- domain_labelslist of int, optional
Optional list of the domain label for each entry. If given, should contain values in {0, 1, …, num_domains}.
- domain_nameslist of str, optional
Optional list of the domain names, ignored if domain_labels is not given. Should be of length numpy.max(domain_labels) + 1. The name in index i corresponds to the domain i in domain_labels. If domain_labels is given and domain_names is not, simple numbering is used. Cannot contain “All”, as it is saved for all the domains.
- Returns
- LDA_explainer
self
- Raises
- ValueError
If any of the parameters given is not as specified.
- RuntimeError
If model is already fit (e.g., if fit() is called twice or if a loaded model is fit).
Notes
Preprocesses the texts (lower casing, removing punctuations, stopwords and words of length <3) before fitting the LDA model, but input for model (if given) is passed as-is.
If \(Z\) is the group of all topics, the seperating topic (for each domain) is chosen by
\[z^{sep} = \arg\max_{z \in Z} \left| \sum_{i \in I_{test}} \hat{y}_i \theta_z^i \right|\]Where \(\hat{y}_i\) is the prediction of the explained model for the \(i^{th}\) document (1 for positive class and -1 for negative class) and \(\theta_z^i\) is the probability of topic \(z\) in document \(i\).
Note that the sign of the score (without absolute value) is saved.
This definition induces symmetry between positive and negative classes.
While model confidence are required, currently only the predictions are used in practice. Model confidence can therefore be replaced with predictions (0 or 1).
-
classmethod
load
(fname)¶ Loads a saved model from files with the prefix fname.
- Parameters
- fnamestr
Optional directory + file names prefix. See save() for example.
- Returns
- LDA_explainer
The loaded model.
- Raises
- FileNotFoundError
If one or more of the files is not found.
-
plot_topic_confidence_trends
(save_fname=None, colors=None)¶ Plots topic-confidence trend line for all documents and for each domain.
- Parameters
- save_fnamestr, optional
The file name and path for the saved figure (in PNG format). If not specified, the figure will not be saved.
- colorsOrderedDict of (str, str) or list of str, optional
A (color_name : color) dictionary or just colors list. These colors determine the trend line colors of the domains. The “All” trend line is always black. If there are more domains than colors, colors are reused rotationally. (default = matplotlib.colors.TABLEAU_COLORS)
- Returns
- matplotlib.figure.Figure
The figure in which the trend lines are plotted.
Notes
The figure is generated as in [1]: If \(\theta^i_z\) is the probability of topic \(z\) in document \(i\), then for each \(j \in J = \{0.1, 0.2, ..., 1\}\) we take the average confidence of the explained model over \(I^j := \{i|\theta^i \in (j-0.1, j]\}\), i.e.,
\[f(\theta; j) = \frac{\sum_{i \in I^j} \hat{p}_i}{|I^j|}\]Where \(\hat{p}_i\) is the confidence of the model that document \(i\) belongs to the positive class.
If there are more than 10 domains and no color scheme is specified, colors will be reused for trend lines.
References
- 1
Oved, N., Feder, A. and Reichart, R., 2020. Predicting In-Game Actions from Interviews of NBA Players. Computational Linguistics, pp.1-46.
-
plot_topics_dominance
(save_fstr=None)¶ Plots the average dominance of each topic in each domain (and in all domains). Draws a seperate lolipop chart for each domain and one for all domains.
- Parameters
- save_fstrstr, optional
A formatable string for saving the figures. Must contain “%s”. E.g., “topic_dominance_%s.png”. Domain names will replace “%s”.
- Returns
- list of matplotlib.figure.Figure
List of the drawn figures.
Notes
Width of the figures is 6 * num_topics / 20 inches, height is 4 inches.
“.png” suffix is added to save_fstr if missing.
-
save
(fname)¶ Saves the model to files with the prefix fname.
- Parameters
- fnamestr
Optional directory + file names prefix. For example, if fname = “./saved_models/explainer”, multiple files in “./saved_models/” will be written with the prefix “explainer”.
- Raises
- RuntimeError
If fit() was not called for the model earlier.
Notes
9 files are saved.