Use Gensim to train Word2Vec Model

文章目录
  1. 1. Use Gensim to train a word2vec model
  2. 2. Task 1 Analogy Prediction
  3. 3. Task 2 Clustering Task

View source code on Github page

Use Gensim to train a word2vec model

Use Gensim tool to tain a word embedding model on text8 dataset and the training process is based on skip-gram architecture.

Reference:

[1] https://rare-technologies.com/deep-learning-with-word2vec-and-gensim/

[2] https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors

1
2
3
4
5
6
7
8
# import modules and set up logging
from gensim.models import word2vec
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# load up unzipped corpus from http://mattmahoney.net/dc/text8.zip
sentences = word2vec.Text8Corpus('data/text8')
1
2
# train the skip-gram model; default window=5
model = word2vec.Word2Vec(sentences, size=200)

Task 1 Analogy Prediction

  • Input: a pair of words and a third word c
  • Output: the forth word d which holds “a is to be as c is to d”

For example

  • “man -> woman” => “king -> queen”
  • “Japan -> Japanese” => “Australia -> Australian”
  • “Paris -> France” => “Beijing -> China”

The above task can describe as Math form:

Given word vector $w_a,w_b,w_c$ to find a word vector $w_d$ satisfying:

Then we solve

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Use gensim api
test_groups = [
('man', 'woman', 'king'),
('small', 'smaller', 'big'),
('italy', 'italian', 'china'),
('japan', 'tokyo', 'china'),
('cool', 'coolest', 'cold'),
('dark', 'darkest', 'easy'),
('listening', 'listened', 'moving'),
('looking', 'looked', 'swimming'),
('playing', 'played', 'taking'),
('increase', 'increases', 'decrease'),
('predict', 'predicts', 'shuffle'),
('provide', 'provides', 'search'),
('say', 'says', 'speak'),
('Austria', 'Austrian', 'Sweden'),
('Cambodia', 'Cambodian', 'Australia'),
('paying', 'paid', 'striking'),
('running', 'ran', 'taking'),
('selling', 'sold', 'thinking'),
('shrinking', 'shrank', 'jumping')
]

for w in test_groups:
print("%s -> %s like %s -> %s" % (w[0], w[1], w[2], model.wv.most_similar(positive=[w[1].lower(), w[2].lower()], negative=[w[0].lower()], topn=1)[0][0]))
man -> woman like king -> queen
small -> smaller like big -> bigger
italy -> italian like china -> chinese
japan -> tokyo like china -> shanghai
cool -> coolest like cold -> kargil
dark -> darkest like easy -> customers
listening -> listened like moving -> penetrated
looking -> looked like swimming -> rides
playing -> played like taking -> took
increase -> increases like decrease -> decreases
predict -> predicts like shuffle -> gutter
provide -> provides like search -> tutorial
say -> says like speak -> speaks
Austria -> Austrian like Sweden -> swedish
Cambodia -> Cambodian like Australia -> canada
paying -> paid like striking -> noticeable
running -> ran like taking -> took
selling -> sold like thinking -> thought
shrinking -> shrank like jumping -> kicking
1
2
words = model.wv.vocab.keys() # contains all vocabularies
word2vec = model.wv # Get a certain word vector by word2vec[word]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Difine naive most_similar() function
import numpy as np

def cal_cosine(vec_a, vec_b):
"""
Compute the cosine similarity between vec_a and vec_b

Input: two word vectors vec_a and vec_b
Output: the cosine similarity of the two vectors
"""

numerator = np.dot(vec_a.T,vec_b)
denominator = np.sqrt(sum(np.square(vec_a))) * np.sqrt(sum(np.square(vec_b)))
return numerator / denominator

def find_analogy(a, b, c):
"""
Find the analogy word

Input: a pair of words <a,b> and a third word c
Output: the forth word d which holds "a is to be as c is to d"
"""

a, b, c = a.lower(), b.lower(), c.lower() # lower all the letters

max_cosine = -1 # initial the max_cosine value = -1
for word in words:
if word in [a, b, c]:
continue
cosine = cal_cosine(word2vec[b] - word2vec[a] + word2vec[c], word2vec[word])
# if find a bigger cosine value, then save the cosine value and related word
if cosine > max_cosine:
max_cosine = cosine
d = word
return d

for w in test_groups:
print("%s -> %s like %s -> %s" % (w[0], w[1], w[2], find_analogy(w[0], w[1], w[2])))
man -> woman like king -> queen
small -> smaller like big -> bigger
italy -> italian like china -> chinese
japan -> tokyo like china -> beijing
cool -> coolest like cold -> falklands
dark -> darkest like easy -> difficult
listening -> listened like moving -> moves
looking -> looked like swimming -> perished
playing -> played like taking -> took
increase -> increases like decrease -> decreases
predict -> predicts like shuffle -> lodewijk
provide -> provides like search -> summarizes
say -> says like speak -> speaks
Austria -> Austrian like Sweden -> swedish
Cambodia -> Cambodian like Australia -> canada
paying -> paid like striking -> caught
running -> ran like taking -> took
selling -> sold like thinking -> realized
shrinking -> shrank like jumping -> kicking
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Testing data is downloaded from https://code.google.com/archive/p/word2vec/source/default/source
# load the testing data set
with open("data/questions-words.txt", "r") as f:
total_line = 0
count = 0
no_key_num = 0
for line in f.readlines():
v = line.strip().split(" ")
if v[0] == ':':
# if the line begins with ":", do nothing
continue
try:
pred_word = model.wv.most_similar(positive=[v[1].lower(), v[2].lower()], negative=[v[0].lower()], topn=1)[0][0]
except KeyError:
# if the word doesn't appear in the model vacabulary list
no_key_num += 1
else:
# v[3] is the right answer
if pred_word == v[3]:
count += 1
total_line += 1

acc = count / total_line
print("The accuracy on the test data = %s" % acc)
print("%s pairs of data were not used because of no matched word in the model" % no_key_num)
The accuracy on the test data = 0.160793194874061
1440 pairs of data were not used because of no matched word in the model
1
2
# Gensim API to evaluate the analogy prediction task
model.wv.evaluate_word_analogies(questions="data/questions-words.txt")
C:\Users\yelbee\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `accuracy` (Method will be removed in 4.0.0, use self.evaluate_word_analogies() instead).
  """Entry point for launching an IPython kernel.
2020-04-05 17:06:55,747 : INFO : capital-common-countries: 34.2% (173/506)
2020-04-05 17:07:01,033 : INFO : capital-world: 18.7% (271/1452)
2020-04-05 17:07:02,009 : INFO : currency: 12.7% (34/268)
2020-04-05 17:07:07,788 : INFO : city-in-state: 11.0% (173/1571)
2020-04-05 17:07:08,987 : INFO : family: 81.4% (249/306)
2020-04-05 17:07:11,881 : INFO : gram1-adjective-to-adverb: 12.3% (93/756)
2020-04-05 17:07:12,983 : INFO : gram2-opposite: 17.3% (53/306)
2020-04-05 17:07:18,473 : INFO : gram3-comparative: 63.9% (805/1260)
2020-04-05 17:07:20,422 : INFO : gram4-superlative: 34.4% (174/506)
2020-04-05 17:07:24,023 : INFO : gram5-present-participle: 30.6% (304/992)
2020-04-05 17:07:30,591 : INFO : gram6-nationality-adjective: 56.1% (769/1371)
2020-04-05 17:07:36,784 : INFO : gram7-past-tense: 27.0% (359/1332)
2020-04-05 17:07:40,974 : INFO : gram8-plural: 43.6% (433/992)
2020-04-05 17:07:43,601 : INFO : gram9-plural-verbs: 35.1% (228/650)
2020-04-05 17:07:43,602 : INFO : total: 33.6% (4118/12268)

From above, we implement our own naive method to calculate cosine similarity to do analogy prediction task, it run slower than Gensim api method. Most of their prediction results are the same like:

1
2
3
man -> woman like king -> queen
small -> smaller like big -> bigger
italy -> italian like china -> chinese

However, there are some different prediction results like:

1
2
3
4
5
Naive Method:japan -> tokyo like china -> (beijing)
Gensim API Method: japan -> tokyo like china -> (shanghai)

Naive Method: selling -> sold like thinking -> (realized)
Gensim API Method: selling -> sold like thinking -> (thought)

In the first situation, the Naive Method is right because Beijing is the capital of China and so as Tokyo to Japan. And the second pairs should describe the past tense of a word and the Gensim API Method is more correct.

At last, we test our model on the Google Analogy Test Set.

Task 2 Clustering Task

In this task, we use K-Means algorithm to cluter the word into 100 groups.

Input:

  • $N$ vectors $x_1, x_2, \cdots, x_N \in \mathbb{R}^n$
  • $k$: the number of cluters we want

Output:

  • $c_i (i=1,2,\cdots,N)$: the cluter that $x_i$ belongs to
  • $z_j (j=1,2, \cdots, k)$: the representative vector of each cluster

Initialization: Initialize $z_1,z_2, \cdots, z_k$ by choosing $k$ vectors from $x_1, x_2, \cdots, x_N$ randomly

Step 1: Given $z_1,z_2, \cdots, z_k$, compute

and define

Step 2: Given $G_1,G_2,\cdots,G_k$, compute

Go back Step 1 until convergent

1
2
3
4
5
# here you load vectors for each word in your model
word2vec_vectors = model.wv.vectors

# Form a dict that use word to find the index
ind_to_word = {model.wv.vocab[word].index : word for word in model.wv.vocab}
1
2
3
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=100, random_state=0).fit(word2vec_vectors)
1
kmeans.labels_
array([92,  4,  4, ...,  6,  6,  6])
1
2
3
4
5
6
7
8
9
10
# Build a cluster structure
# Dict{cluster_num : [word list]}
clusters = {}
index = 0
for n in kmeans.labels_:
if n not in clusters.keys():
clusters[n] = [ind_to_word[index]]
else:
clusters[n].append(ind_to_word[index])
index += 1
1
2
for n, cluster in clusters.items():
print("%s -> %s" %(n, cluster[:5]))
92 -> ['the', 'in', 'for', 'on', 'first']
4 -> ['of', 'and', 'with', 'making', 'permanent']
73 -> ['one', 'zero', 'nine', 'two', 'eight']
10 -> ['a', 'as', 'or', 'an', 'being']
38 -> ['to', 'can', 'may', 'would', 'will']
21 -> ['is', 'means', 'uses', 'includes', 'remains']
78 -> ['s', 'mark', 'don', 'brown', 'ray']
42 -> ['was', 'became', 'left', 'began', 'took']
77 -> ['by', 'under', 'initially', 'subsequently', 'jordan']
1 -> ['that', 'it', 'this', 'which', 'also']
20 -> ['are', 'use', 'include', 'related', 'believe']
84 -> ['from', 'according', 'addition', 'giving', 'dedicated']
97 -> ['his', 'he', 'her', 'him', 'she']
52 -> ['be', 'make', 'become', 'take', 'run']
79 -> ['at', 'city', 'home', 'near', 'center']
61 -> ['have', 'has', 'had', 'having']
93 -> ['were', 'war', 'against', 'military', 'force']
37 -> ['other', 'their', 'some', 'all', 'such']
96 -> ['its', 'power', 'due', 'control', 'development']
81 -> ['more', 'most', 'very', 'less', 'particularly']
51 -> ['been', 'made', 'led', 'developed', 'included']
99 -> ['used', 'known', 'called', 'found', 'considered']
72 -> ['there', 'held', 'created', 'produced', 'established']
68 -> ['american', 'john', 'james', 'william', 'david']
48 -> ['time', 'years', 'year', 'day', 'times']
32 -> ['see', 'links', 'external', 'list', 'information']
74 -> ['than', 'about', 'over', 'around', 'every']
26 -> ['world', 'states', 'u', 'national', 'group']
86 -> ['b', 'd', 'born', 'actor', 'author']
89 -> ['people', 'population', 'groups', 'living', 'species']
33 -> ['united', 'british', 'canada', 'spanish', 'australia']
56 -> ['system', 'computer', 'systems', 'data', 'standard']
64 -> ['state', 'law', 'court', 'rights', 'act']
94 -> ['history', 'century', 'modern', 'old', 'greek']
45 -> ['up', 'out', 'right', 'line', 'back']
90 -> ['english', 'name', 'language', 'term', 'word']
54 -> ['well', 'much', 'common', 'popular', 'important']
19 -> ['e', 'c', 'x', 'g', 't']
71 -> ['government', 'party', 'members', 'parliament', 'elected']
95 -> ['m', 'km', 'square', 'miles', 'feet']
66 -> ['university', 'school', 'college', 'education', 'medical']
67 -> ['life', 'work', 'way', 'view', 'nature']
34 -> ['like', 'black', 'white', 'red', 'blue']
30 -> ['including', 'best', 'famous', 'art', 'writers']
76 -> ['example', 'form', 'set', 'numbers', 'function']
14 -> ['french', 'german', 'italian', 'russian', 'prize']
49 -> ['general', 'president', 'former', 'leader', 'chief']
85 -> ['high', 'level', 'low', 'rate', 'higher']
87 -> ['based', 'originally', 'test', 'via', 'multi']
0 -> ['now', 'principal', 'cook', 'resident', 'venice']
47 -> ['de', 'l', 'al', 'la', 'san']
41 -> ['music', 'style', 'band', 'album', 'rock']
11 -> ['great', 'possibly', 'lives', 'bad', 'apparently']
27 -> ['south', 'north', 'area', 'west', 'east']
91 -> ['series', 'film', 'character', 'story', 'films']
31 -> ['game', 'player', 'games', 'team', 'play']
57 -> ['country', 'europe', 'france', 'england', 'germany']
80 -> ['king', 'ii', 'roman', 'empire', 'emperor']
9 -> ['book', 'works', 'published', 'books', 'text']
22 -> ['political', 'others', 'social', 'movement', 'anti']
25 -> ['church', 'god', 'christian', 'jewish', 'religious']
40 -> ['theory', 'science', 'natural', 'research', 'study']
55 -> ['using', 'single', 'type', 'lines', 'color']
70 -> ['human', 'cause', 'effects', 'health', 'blood']
5 -> ['point', 'field', 'above', 'position', 'range']
50 -> ['public', 'us', 'company', 'economic', 'production']
98 -> ['man', 'men', 'children', 'person', 'women']
69 -> ['york', 'london', 'california', 'county', 'founded']
62 -> ['house', 'official', 'member', 'minister', 'council']
24 -> ['water', 'energy', 'material', 'cell', 'chemical']
46 -> ['original', 'version', 'released', 'video', 'release']
53 -> ['air', 'service', 'fire', 'aircraft', 'nuclear']
65 -> ['space', 'earth', 'light', 'image', 'star']
88 -> ['said', 'claim', 'claims', 'stated', 'says']
18 -> ['along', 'across', 'ice', 'cold', 'upper']
58 -> ['show', 'television', 'uk', 'radio', 'live']
63 -> ['january', 'march', 'december', 'july', 'june']
59 -> ['areas', 'parts', 'cities', 'outside', 'currently']
15 -> ['terms', 'forms', 'cases', 'elements', 'events']
13 -> ['japanese', 'etc', 'except', 'unlike', 'respectively']
3 -> ['class', 'bond', 'composition', 'partial', 'representing']
7 -> ['whose', 'raised', 'executed', 'serving', 'attacked']
83 -> ['my', 'big', 'dead', 'cover', 'dark']
16 -> ['irish', 'becoming', 'historically', 'amongst', 'elsewhere']
29 -> ['action', 'defense', 'police', 'acts', 'intelligence']
17 -> ['food', 'animals', 'gold', 'animal', 'iron']
28 -> ['co', 'am', 'na', 'die', 'ch']
8 -> ['award', 'grand', 'race', 'fame', 'super']
6 -> ['previously', 'fourteen', 'handful', 'polls', 'fortunes']
75 -> ['cycle', 'formation', 'secondary', 'producing', 'naturally']
60 -> ['regarding', 'concerning', 'describing', 'aristotle', 'oral']
23 -> ['semi', 'showing', 'apart', 'gates', 'whilst']
35 -> ['honor', 'adam', 'abraham', 'muhammad', 'passage']
82 -> ['partially', 'locally', 'carefully', 'lacking', 'attraction']
12 -> ['moreover', 'consequently', 'besides', 'repeatedly', 'likewise']
36 -> ['independently', 'comprises', 'marking', 'excluding', 'astronomers']
39 -> ['associate', 'publishers', 'eds', 'chair', 'wesley']
44 -> ['thirteen', 'eighteen', 'culminating', 'seventy', 'narrowly']
2 -> ['ironically', 'nice', 'sixteen', 'stamp', 'nights']
43 -> ['onwards', 'popularly', 'atta', 'eighty', 'variously']

We need to illustrate how to pick best clusters, i.e. how to evaluate a cluster’s good or bad? We use the L2-norm of the vectors compared with the center vector in a cluster to evaluate.

Define score

Hence, the lower the score, the better the cluster.

1
2
3
4
5
6
7
8
9
for n, cluster in clusters.items():
scores = 0
for word in cluster:
# the smaller the score, the better the cluster
scores += sum(np.square(word2vec[word] - kmeans.cluster_centers_[n]))

# Original clusters => Dict{cluster_num : [word list]}
# Now it becomes Dict(cluster_num : ([word list], score))
clusters[n] = (clusters[n], scores)
1
sorted(clusters.items(), key = lambda x: x[1][1])[:5] # sortedby score ascending
[(63,
  (['january',
    'march',
    'december',
    'july',
    'june',
    'november',
    'april',
    'september',
    'august',
    'october',
    'february'],
   165.74816360414897)),
 (61, (['have', 'has', 'had', 'having'], 586.166584790995)),
 (81,
  (['more',
    'most',
    'very',
    'less',
    'particularly',
    'too',
    'highly',
    'enough',
    'relatively',
    'quite',
    'extremely',
    'somewhat',
    'increasingly',
    'fairly'],
   1517.4123485378213)),
 (86,
  (['b',
    'd',
    'born',
    'actor',
    'author',
    'writer',
    'singer',
    'actress',
    'composer',
    'poet',
    'musician',
    'artist',
    'politician',
    'philosopher',
    'mathematician',
    'painter',
    'journalist',
    'footballer',
    'novelist'],
   2115.24780004326)),
 (47,
  (['de',
    'l',
    'al',
    'la',
    'san',
    'paris',
    'et',
    'le',
    'el',
    'der',
    'des',
    'bwv',
    'del',
    'da',
    'di',
    'il',
    'du',
    'en',
    'ma',
    'santa',
    'te',
    'les',
    'fran',
    'ne',
    'juan',
    'und',
    'sur'],
   2692.603341114255))]

According to the above results, we pick up top 4 best clusters:

  • Cluster#63 =>
    {'january','march','december','july','june',november','april','september','august','october','february'}
    • score = 165.75
    • This cluster contains months
  • Cluster#61 =>
    {'have', 'has', 'had', 'having'}
    • socre = 586.17
    • This cluster contains different tense of word have
  • Cluster#81 =>
    {'more','most','very','less','particularly','too','highly','enough','relatively','quite','extremely','somewhat','fairly'}
    • score = 1517.41
    • They are some kinds of degree adverb
  • Cluster#86 =>
    {'b','d','born','actor','author','writer','singer','actress','composer','poet','musician','artist','politician','philosopher','mathematician','painter','journalist','footballer','novelist'}
    • score = 2115.25
    • This cluster contains many occupations