Main Content

Visualize Word Embeddings Using Text Scatter Plots

这个例子展示了如何可视化词嵌入s using 2-D and 3-D t-SNE and text scatter plots.

Word embeddings map words in a vocabulary to real vectors. The vectors attempt to capture the semantics of the words, so that similar words have similar vectors. Some embeddings also capture relationships between words like "Italy is to France as Rome is to Paris". In vector form, this relationship is Italy - Rome + Paris = France .

Load Pretrained Word Embedding

Load a pretrained word embedding usingfastTextWordEmbedding. This function requires Text Analytics Toolbox™ Modelfor fastText English 16 Billion Token Word Embeddingsupport package. If this support package is not installed, then the function provides a download link.

emb = fastTextWordEmbedding
emb = wordEmbedding with properties: Dimension: 300 Vocabulary: [1×999994 string]

Explore the word embedding usingword2vecandvec2word. Convert the wordsItaly,Rome, andParisto vectors usingword2vec.

italy = word2vec(emb,"Italy"); rome = word2vec(emb,"Rome"); paris = word2vec(emb,"Paris");

Compute the vector given byitaly - rome + paris. This vector encapsulates the semantic meaning of the wordItaly, without the semantics of the wordRome, and also includes the semantics of the wordParis.

vec = italy - rome + paris
vec =1×300 single row vector0.1606 -0.0690 0.1183 -0.0349 0.0672 0.0907 -0.1820 -0.0080 0.0320 -0.0936 -0.0329 -0.1548 0.1737 -0.0937 -0.1619 0.0777 -0.0843 0.0066 0.0600 -0.2059 -0.0268 0.1350 -0.0900 0.0314 0.0686 -0.0338 0.1841 0.1708 0.0276 0.0719 -0.1667 0.0231 0.0265 -0.1773 -0.1135 0.1018 -0.2339 0.1008 0.1057 -0.1118 0.2891 -0.0358 0.0911 -0.0958 -0.0184 0.0740 -0.1081 0.0826 0.0463 0.0043

Find the closest words in the embedding tovecusingvec2word.

word = vec2word(emb,vec)
word = "France"

Create 2-D Text Scatter Plot

Visualize the word embedding by creating a 2-D text scatter plot usingtsneandtextscatter.

Convert the first 5000 words to vectors usingword2vec.Vis a matrix of word vectors of length 300.

words = emb.Vocabulary(1:5000); V = word2vec(emb,words); size(V)
ans =1×25000 300

Embed the word vectors in two-dimensional space usingtsne. This function may take a few minutes to run. If you want to display the convergence information, then set the'Verbose'name-value pair to 1.

XY = tsne(V);

Plot the words at the coordinates specified byXYin a 2-D text scatter plot. For readability,textscatter, by default, does not display all of the input words and displays markers instead.

figure textscatter(XY,words) title("Word Embedding t-SNE Plot")

Zoom in on a section of the plot.

xlim([-18 -5]) ylim([11 21])

Create 3-D Text Scatter Plot

Visualize the word embedding by creating a 3-D text scatter plot usingtsneandtextscatter.

Convert the first 5000 words to vectors usingword2vec.Vis a matrix of word vectors of length 300.

words = emb.Vocabulary(1:5000); V = word2vec(emb,words); size(V)
ans =1×25000 300

Embed the word vectors in a three-dimensional space usingtsneby specifying the number of dimensions to be three. This function may take a few minutes to run. If you want to display the convergence information, then you can set the'Verbose'name-value pair to 1.

XYZ = tsne(V,'NumDimensions',3);

Plot the words at the coordinates specified by XYZ in a 3-D text scatter plot.

figure ts = textscatter3(XYZ,words); title("3-D Word Embedding t-SNE Plot")

Zoom in on a section of the plot.

xlim([12.04 19.48]) ylim([-2.66 3.40]) zlim([10.03 14.53])

Perform Cluster Analysis

Convert the first 5000 words to vectors usingword2vec.Vis a matrix of word vectors of length 300.

words = emb.Vocabulary(1:5000); V = word2vec(emb,words); size(V)
ans =1×25000 300

Discover 25 clusters usingkmeans.

cidx = kmeans(V,25,'dist','sqeuclidean');

Visualize the clusters in a text scatter plot using the 2-D t-SNE data coordinates calculated earlier.

figure textscatter(XY,words,'ColorData',categorical(cidx)); title("Word Embedding t-SNE Plot")

Zoom in on a section of the plot.

xlim([13 24]) ylim([-47 -35])

See Also

||||||

Related Topics