Visualize Word Embeddings Using Text Scatter Plots
这个例子展示了如何可视化词嵌入s using 2-D and 3-D t-SNE and text scatter plots.
Word embeddings map words in a vocabulary to real vectors. The vectors attempt to capture the semantics of the words, so that similar words have similar vectors. Some embeddings also capture relationships between words like "Italy is to France as Rome is to Paris". In vector form, this relationship is .
Load Pretrained Word Embedding
Load a pretrained word embedding usingfastTextWordEmbedding
. This function requires Text Analytics Toolbox™ Modelfor fastText English 16 Billion Token Word Embeddingsupport package. If this support package is not installed, then the function provides a download link.
emb = fastTextWordEmbedding
emb = wordEmbedding with properties: Dimension: 300 Vocabulary: [1×999994 string]
Explore the word embedding usingword2vec
andvec2word
. Convert the wordsItaly,Rome, andParisto vectors usingword2vec
.
italy = word2vec(emb,"Italy"); rome = word2vec(emb,"Rome"); paris = word2vec(emb,"Paris");
Compute the vector given byitaly - rome + paris
. This vector encapsulates the semantic meaning of the wordItaly, without the semantics of the wordRome, and also includes the semantics of the wordParis.
vec = italy - rome + paris
vec =1×300 single row vector0.1606 -0.0690 0.1183 -0.0349 0.0672 0.0907 -0.1820 -0.0080 0.0320 -0.0936 -0.0329 -0.1548 0.1737 -0.0937 -0.1619 0.0777 -0.0843 0.0066 0.0600 -0.2059 -0.0268 0.1350 -0.0900 0.0314 0.0686 -0.0338 0.1841 0.1708 0.0276 0.0719 -0.1667 0.0231 0.0265 -0.1773 -0.1135 0.1018 -0.2339 0.1008 0.1057 -0.1118 0.2891 -0.0358 0.0911 -0.0958 -0.0184 0.0740 -0.1081 0.0826 0.0463 0.0043
Find the closest words in the embedding tovec
usingvec2word
.
word = vec2word(emb,vec)
word = "France"
Create 2-D Text Scatter Plot
Visualize the word embedding by creating a 2-D text scatter plot usingtsne
andtextscatter
.
Convert the first 5000 words to vectors usingword2vec
.V
is a matrix of word vectors of length 300.
words = emb.Vocabulary(1:5000); V = word2vec(emb,words); size(V)
ans =1×25000 300
Embed the word vectors in two-dimensional space usingtsne
. This function may take a few minutes to run. If you want to display the convergence information, then set the'Verbose'
name-value pair to 1.
XY = tsne(V);
Plot the words at the coordinates specified byXY
in a 2-D text scatter plot. For readability,textscatter
, by default, does not display all of the input words and displays markers instead.
figure textscatter(XY,words) title("Word Embedding t-SNE Plot")
Zoom in on a section of the plot.
xlim([-18 -5]) ylim([11 21])
Create 3-D Text Scatter Plot
Visualize the word embedding by creating a 3-D text scatter plot usingtsne
andtextscatter
.
Convert the first 5000 words to vectors usingword2vec
.V
is a matrix of word vectors of length 300.
words = emb.Vocabulary(1:5000); V = word2vec(emb,words); size(V)
ans =1×25000 300
Embed the word vectors in a three-dimensional space usingtsne
by specifying the number of dimensions to be three. This function may take a few minutes to run. If you want to display the convergence information, then you can set the'Verbose'
name-value pair to 1.
XYZ = tsne(V,'NumDimensions',3);
Plot the words at the coordinates specified by XYZ in a 3-D text scatter plot.
figure ts = textscatter3(XYZ,words); title("3-D Word Embedding t-SNE Plot")
Zoom in on a section of the plot.
xlim([12.04 19.48]) ylim([-2.66 3.40]) zlim([10.03 14.53])
Perform Cluster Analysis
Convert the first 5000 words to vectors usingword2vec
.V
is a matrix of word vectors of length 300.
words = emb.Vocabulary(1:5000); V = word2vec(emb,words); size(V)
ans =1×25000 300
Discover 25 clusters usingkmeans
.
cidx = kmeans(V,25,'dist','sqeuclidean');
Visualize the clusters in a text scatter plot using the 2-D t-SNE data coordinates calculated earlier.
figure textscatter(XY,words,'ColorData',categorical(cidx)); title("Word Embedding t-SNE Plot")
Zoom in on a section of the plot.
xlim([13 24]) ylim([-47 -35])
See Also
readWordEmbedding
|textscatter
|textscatter3
|word2vec
|vec2word
|wordEmbedding
|tokenizedDocument