This notebook is available here and its purpose is to demonstrate the functionality of the class LDA_Explainer
. For further details, see the API documentation.
from LDA_Explanation.lda_explainer import LDA_Explainer
The data is taken from CausaLM Datasets and the confidence was calculated using a BERT classifier from Trasformers for sentiment classification.
The specific csv file we use is available here.
This data is ruled by domains (reviews for movies, books, electronics and kitchen products), but the module works the same without domains given (try passing domain_labels = None
at initialization). The last domain (DVD) is dropped since it is too similar to the first (movies).
import pandas as pd # For presenting the data, not necessary.
df = pd.read_csv('reviews_with_confidence.csv')
df = df[df['domain_label'] != 4]
df
explainer = LDA_Explainer(num_topics = 30).fit(texts = df['review'],
model_confidence = df['confidence'],
domain_labels = df['domain_label'],
domain_names = ['Movies', 'Books', 'Electronics', 'Kitchen']
)
# explainer.save('./saved_models/lda30') # Creates multiple files in "./saved_models/" with the prefix "lda30".
explainer = LDA_Explainer.load('./saved_models/lda30')
Highlighted are topics that seem to represent the domains.
Highlighting is manual and was made specifically for this model. Fitting a new model will result in different topics.
explainer.display_topics(topn = 13, # 13 to fit nicely in a webpage
colors = { # Manual coloring
'red': [2, 14], # Movies
'green': [10, 16], # Books
'blue': [4, 8, 9], # Electronics
'purple': [3, 21] # Kitchen
}
)
For each domain $d$ (and all domains), we choose the separating topic $z^d$ that maximizes $$ z^d = \arg\max_{z \in Z} \left| \sum_{i \in I^d} \hat{y}_i \theta_z^i \right| $$ Where $Z$ is the set of all topics, $I^d$ is the set of documents belonging to the domain $d$, $\hat{y}_i$ is the prediction of the explained model ($1$ or $-1$) and $\theta_z^i$ is the probability (dominance) of topic $z$ in document $i$.
This measure is meant to encapsulate the separating ability of the topics (i.e., how much the presence of the topic in the document affects the prediction).
The score
values in the following table is the score of the separating topic with its sign, i.e., $\sum_{i \in I^d} \hat{y}_i \theta_{z^d}^i$.
explainer.display_seperating_topics(topn = 15)
As in Reichart et al. 2020 (see documentation), for each topic probability $j \in \{0.1, 0.2, ..., 1\}$, the confidence is averaged over $\big\{i \in I^d : \theta^i_{z^d} \in (j - 0.1, j]\big\}$.
figure = explainer.plot_topic_confidence_trends()
Just cause.
figures = explainer.plot_topics_dominance()