Main Content

fastTextWordEmbedding

Pretrained fastText word embedding

Description

example

emb= fastTextWordEmbeddingreturns a 300-dimensional pretrained word embedding for 1 million English words.

This function requires theText Analytics Toolbox™ Modelfor fastText English 16 Billion Token Word Embeddingsupport package. If this support package is not installed, the function provides a download link.

Examples

collapse all

Download and install theText Analytics Toolbox Modelfor fastText English 16 Billion Token Word Embeddingsupport package.

TypefastTextWordEmbeddingat the command line.

fastTextWordEmbedding

If theText Analytics Toolbox Modelfor fastText English 16 Billion Token Word Embeddingsupport package is not installed, then the function provides a link to the required support package in the Add-On Explorer. To install the support package, click the link, and then clickInstall. Check that the installation is successful by typingemb = fastTextWordEmbeddingat the command line.

emb = fastTextWordEmbedding
emb = wordEmbedding with properties: Dimension: 300 Vocabulary: [1×1000000 string]

If the required support package is installed, then the function returns awordEmbeddingobject.

Load a pretrained word embedding usingfastTextWordEmbedding. This function requires Text Analytics Toolbox™ Modelfor fastText English 16 Billion Token Word Embeddingsupport package. If this support package is not installed, then the function provides a download link.

emb = fastTextWordEmbedding
emb = wordEmbedding with properties: Dimension: 300 Vocabulary: [1×1000000 string]

Map the words "Italy", "Rome", and "Paris" to vectors usingword2vec.

italy = word2vec(emb,"Italy"); rome = word2vec(emb,"Rome"); paris = word2vec(emb,"Paris");

Map the vectoritaly - rome + paristo a word usingvec2word.

word = vec2word(emb,italy - rome + paris)
word = "France"

Convert an array of tokenized documents to sequences of word vectors using a pretrained word embedding.

Load a pretrained word embedding using thefastTextWordEmbeddingfunction. This function requires Text Analytics Toolbox™ Modelfor fastText English 16 Billion Token Word Embeddingsupport package. If this support package is not installed, then the function provides a download link.

emb = fastTextWordEmbedding;

Load the factory reports data and create atokenizedDocumentarray.

filename ="factoryReports.csv"; data = readtable(filename,'TextType','string'); textData = data.Description; documents = tokenizedDocument(textData);

Convert the documents to sequences of word vectors usingdoc2sequence. Thedoc2sequence函数,默认情况下,left-pads h的序列ave the same length. When converting large collections of documents using a high-dimensional word embedding, padding can require large amounts of memory. To prevent the function from padding the data, set the'PaddingDirection'option to'none'. Alternatively, you can control the amount of padding using the'Length'option.

sequences = doc2sequence(emb,documents,'PaddingDirection','none');

View the sizes of the first 10 sequences. Each sequence isD-by-Smatrix, whereDis the embedding dimension, andSis the number of word vectors in the sequence.

sequences(1:10)
ans=10×1 cell array{300×10 single} {300×11 single} {300×11 single} {300×6 single} {300×5 single} {300×10 single} {300×8 single} {300×9 single} {300×7 single} {300×13 single}

Output Arguments

collapse all

Pretrained word embedding, returned as awordEmbeddingobject.

Version History

Introduced in R2018a