forked from tmikolov/word2vec
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
ilyasu
committed
Jul 30, 2013
1 parent
e2bd524
commit efc5e10
Showing
1 changed file
with
29 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
|
||
We provide an implementation of the Continuous Bag-of-Words (CBOW) and the Skip-gram model (SG). | ||
|
||
Given a text corpus, the word2vec program learns a vector for every word using the Continuous | ||
Bag-of-Words or the Skip-Gram model. The user needs to specify the following: | ||
- desired vector dimensionality | ||
- the size of the context window for either the Skip-Gram or the Continuous Bag-of-Words model | ||
- Whether hierarchical sampling is used | ||
- Whether negative sampling is used, and if so, how many negative samples should be used | ||
- A threshold for downsampling frequent words | ||
- Number of threads to use | ||
- Whether to save the vectors in a text format or a binary format | ||
|
||
|
||
Thus the programs require a very modest number of parameter. In particular, learning rates | ||
need not be selected. | ||
|
||
The file demo-word.sh downloads a small (100MB) text corpus, and trains a 200-dimensional CBOW model | ||
with a window of size 5, negative sampling with 5 negative samples, a downsampling of 1e-3, 12 threads, and binary files. | ||
|
||
./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 5 -negative 5 -hs 0 -sample 1e-3 -threads 12 -binary 1 | ||
|
||
|
||
Then, to evaluate the fidelity of our vectors, we can run the command, which will run | ||
a battery of tests on the vectors to determine their fidelity. The tests evaluate | ||
the vectors' ability to perform linear analogies. | ||
|
||
./distance vectors.bin | ||
|