View source code on Github page
Use Gensim to train a word2vec model
Use Gensim tool to tain a word embedding model on text8
dataset and the training process is based on skip-gram
architecture.
Reference:
[1] https://rare-technologies.com/deep-learning-with-word2vec-and-gensim/
1 | # import modules and set up logging |
1 | # train the skip-gram model; default window=5 |
Task 1 Analogy Prediction
- Input: a pair of words and a third word c
- Output: the forth word d which holds “a is to be as c is to d”
For example
- “man -> woman” => “king -> queen”
- “Japan -> Japanese” => “Australia -> Australian”
- “Paris -> France” => “Beijing -> China”
The above task can describe as Math form:
Given word vector $w_a,w_b,w_c$ to find a word vector $w_d$ satisfying:
Then we solve
1 | # Use gensim api |
man -> woman like king -> queen
small -> smaller like big -> bigger
italy -> italian like china -> chinese
japan -> tokyo like china -> shanghai
cool -> coolest like cold -> kargil
dark -> darkest like easy -> customers
listening -> listened like moving -> penetrated
looking -> looked like swimming -> rides
playing -> played like taking -> took
increase -> increases like decrease -> decreases
predict -> predicts like shuffle -> gutter
provide -> provides like search -> tutorial
say -> says like speak -> speaks
Austria -> Austrian like Sweden -> swedish
Cambodia -> Cambodian like Australia -> canada
paying -> paid like striking -> noticeable
running -> ran like taking -> took
selling -> sold like thinking -> thought
shrinking -> shrank like jumping -> kicking
1 | words = model.wv.vocab.keys() # contains all vocabularies |
1 | # Difine naive most_similar() function |
man -> woman like king -> queen
small -> smaller like big -> bigger
italy -> italian like china -> chinese
japan -> tokyo like china -> beijing
cool -> coolest like cold -> falklands
dark -> darkest like easy -> difficult
listening -> listened like moving -> moves
looking -> looked like swimming -> perished
playing -> played like taking -> took
increase -> increases like decrease -> decreases
predict -> predicts like shuffle -> lodewijk
provide -> provides like search -> summarizes
say -> says like speak -> speaks
Austria -> Austrian like Sweden -> swedish
Cambodia -> Cambodian like Australia -> canada
paying -> paid like striking -> caught
running -> ran like taking -> took
selling -> sold like thinking -> realized
shrinking -> shrank like jumping -> kicking
1 | # Testing data is downloaded from https://code.google.com/archive/p/word2vec/source/default/source |
The accuracy on the test data = 0.160793194874061
1440 pairs of data were not used because of no matched word in the model
1 | # Gensim API to evaluate the analogy prediction task |
C:\Users\yelbee\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `accuracy` (Method will be removed in 4.0.0, use self.evaluate_word_analogies() instead).
"""Entry point for launching an IPython kernel.
2020-04-05 17:06:55,747 : INFO : capital-common-countries: 34.2% (173/506)
2020-04-05 17:07:01,033 : INFO : capital-world: 18.7% (271/1452)
2020-04-05 17:07:02,009 : INFO : currency: 12.7% (34/268)
2020-04-05 17:07:07,788 : INFO : city-in-state: 11.0% (173/1571)
2020-04-05 17:07:08,987 : INFO : family: 81.4% (249/306)
2020-04-05 17:07:11,881 : INFO : gram1-adjective-to-adverb: 12.3% (93/756)
2020-04-05 17:07:12,983 : INFO : gram2-opposite: 17.3% (53/306)
2020-04-05 17:07:18,473 : INFO : gram3-comparative: 63.9% (805/1260)
2020-04-05 17:07:20,422 : INFO : gram4-superlative: 34.4% (174/506)
2020-04-05 17:07:24,023 : INFO : gram5-present-participle: 30.6% (304/992)
2020-04-05 17:07:30,591 : INFO : gram6-nationality-adjective: 56.1% (769/1371)
2020-04-05 17:07:36,784 : INFO : gram7-past-tense: 27.0% (359/1332)
2020-04-05 17:07:40,974 : INFO : gram8-plural: 43.6% (433/992)
2020-04-05 17:07:43,601 : INFO : gram9-plural-verbs: 35.1% (228/650)
2020-04-05 17:07:43,602 : INFO : total: 33.6% (4118/12268)
From above, we implement our own naive method to calculate cosine similarity to do analogy prediction task, it run slower than Gensim api method. Most of their prediction results are the same like:
1 | man -> woman like king -> queen |
However, there are some different prediction results like:
1 | Naive Method:japan -> tokyo like china -> (beijing) |
In the first situation, the Naive Method
is right because Beijing is the capital of China and so as Tokyo to Japan. And the second pairs should describe the past tense of a word and the Gensim API Method
is more correct.
At last, we test our model on the Google Analogy Test Set.
Task 2 Clustering Task
In this task, we use K-Means algorithm to cluter the word into 100 groups.
Input:
- $N$ vectors $x_1, x_2, \cdots, x_N \in \mathbb{R}^n$
- $k$: the number of cluters we want
Output:
- $c_i (i=1,2,\cdots,N)$: the cluter that $x_i$ belongs to
- $z_j (j=1,2, \cdots, k)$: the representative vector of each cluster
Initialization: Initialize $z_1,z_2, \cdots, z_k$ by choosing $k$ vectors from $x_1, x_2, \cdots, x_N$ randomly
Step 1: Given $z_1,z_2, \cdots, z_k$, compute
and define
Step 2: Given $G_1,G_2,\cdots,G_k$, compute
Go back Step 1 until convergent
1 | # here you load vectors for each word in your model |
1 | from sklearn.cluster import KMeans |
1 | kmeans.labels_ |
array([92, 4, 4, ..., 6, 6, 6])
1 | # Build a cluster structure |
1 | for n, cluster in clusters.items(): |
92 -> ['the', 'in', 'for', 'on', 'first']
4 -> ['of', 'and', 'with', 'making', 'permanent']
73 -> ['one', 'zero', 'nine', 'two', 'eight']
10 -> ['a', 'as', 'or', 'an', 'being']
38 -> ['to', 'can', 'may', 'would', 'will']
21 -> ['is', 'means', 'uses', 'includes', 'remains']
78 -> ['s', 'mark', 'don', 'brown', 'ray']
42 -> ['was', 'became', 'left', 'began', 'took']
77 -> ['by', 'under', 'initially', 'subsequently', 'jordan']
1 -> ['that', 'it', 'this', 'which', 'also']
20 -> ['are', 'use', 'include', 'related', 'believe']
84 -> ['from', 'according', 'addition', 'giving', 'dedicated']
97 -> ['his', 'he', 'her', 'him', 'she']
52 -> ['be', 'make', 'become', 'take', 'run']
79 -> ['at', 'city', 'home', 'near', 'center']
61 -> ['have', 'has', 'had', 'having']
93 -> ['were', 'war', 'against', 'military', 'force']
37 -> ['other', 'their', 'some', 'all', 'such']
96 -> ['its', 'power', 'due', 'control', 'development']
81 -> ['more', 'most', 'very', 'less', 'particularly']
51 -> ['been', 'made', 'led', 'developed', 'included']
99 -> ['used', 'known', 'called', 'found', 'considered']
72 -> ['there', 'held', 'created', 'produced', 'established']
68 -> ['american', 'john', 'james', 'william', 'david']
48 -> ['time', 'years', 'year', 'day', 'times']
32 -> ['see', 'links', 'external', 'list', 'information']
74 -> ['than', 'about', 'over', 'around', 'every']
26 -> ['world', 'states', 'u', 'national', 'group']
86 -> ['b', 'd', 'born', 'actor', 'author']
89 -> ['people', 'population', 'groups', 'living', 'species']
33 -> ['united', 'british', 'canada', 'spanish', 'australia']
56 -> ['system', 'computer', 'systems', 'data', 'standard']
64 -> ['state', 'law', 'court', 'rights', 'act']
94 -> ['history', 'century', 'modern', 'old', 'greek']
45 -> ['up', 'out', 'right', 'line', 'back']
90 -> ['english', 'name', 'language', 'term', 'word']
54 -> ['well', 'much', 'common', 'popular', 'important']
19 -> ['e', 'c', 'x', 'g', 't']
71 -> ['government', 'party', 'members', 'parliament', 'elected']
95 -> ['m', 'km', 'square', 'miles', 'feet']
66 -> ['university', 'school', 'college', 'education', 'medical']
67 -> ['life', 'work', 'way', 'view', 'nature']
34 -> ['like', 'black', 'white', 'red', 'blue']
30 -> ['including', 'best', 'famous', 'art', 'writers']
76 -> ['example', 'form', 'set', 'numbers', 'function']
14 -> ['french', 'german', 'italian', 'russian', 'prize']
49 -> ['general', 'president', 'former', 'leader', 'chief']
85 -> ['high', 'level', 'low', 'rate', 'higher']
87 -> ['based', 'originally', 'test', 'via', 'multi']
0 -> ['now', 'principal', 'cook', 'resident', 'venice']
47 -> ['de', 'l', 'al', 'la', 'san']
41 -> ['music', 'style', 'band', 'album', 'rock']
11 -> ['great', 'possibly', 'lives', 'bad', 'apparently']
27 -> ['south', 'north', 'area', 'west', 'east']
91 -> ['series', 'film', 'character', 'story', 'films']
31 -> ['game', 'player', 'games', 'team', 'play']
57 -> ['country', 'europe', 'france', 'england', 'germany']
80 -> ['king', 'ii', 'roman', 'empire', 'emperor']
9 -> ['book', 'works', 'published', 'books', 'text']
22 -> ['political', 'others', 'social', 'movement', 'anti']
25 -> ['church', 'god', 'christian', 'jewish', 'religious']
40 -> ['theory', 'science', 'natural', 'research', 'study']
55 -> ['using', 'single', 'type', 'lines', 'color']
70 -> ['human', 'cause', 'effects', 'health', 'blood']
5 -> ['point', 'field', 'above', 'position', 'range']
50 -> ['public', 'us', 'company', 'economic', 'production']
98 -> ['man', 'men', 'children', 'person', 'women']
69 -> ['york', 'london', 'california', 'county', 'founded']
62 -> ['house', 'official', 'member', 'minister', 'council']
24 -> ['water', 'energy', 'material', 'cell', 'chemical']
46 -> ['original', 'version', 'released', 'video', 'release']
53 -> ['air', 'service', 'fire', 'aircraft', 'nuclear']
65 -> ['space', 'earth', 'light', 'image', 'star']
88 -> ['said', 'claim', 'claims', 'stated', 'says']
18 -> ['along', 'across', 'ice', 'cold', 'upper']
58 -> ['show', 'television', 'uk', 'radio', 'live']
63 -> ['january', 'march', 'december', 'july', 'june']
59 -> ['areas', 'parts', 'cities', 'outside', 'currently']
15 -> ['terms', 'forms', 'cases', 'elements', 'events']
13 -> ['japanese', 'etc', 'except', 'unlike', 'respectively']
3 -> ['class', 'bond', 'composition', 'partial', 'representing']
7 -> ['whose', 'raised', 'executed', 'serving', 'attacked']
83 -> ['my', 'big', 'dead', 'cover', 'dark']
16 -> ['irish', 'becoming', 'historically', 'amongst', 'elsewhere']
29 -> ['action', 'defense', 'police', 'acts', 'intelligence']
17 -> ['food', 'animals', 'gold', 'animal', 'iron']
28 -> ['co', 'am', 'na', 'die', 'ch']
8 -> ['award', 'grand', 'race', 'fame', 'super']
6 -> ['previously', 'fourteen', 'handful', 'polls', 'fortunes']
75 -> ['cycle', 'formation', 'secondary', 'producing', 'naturally']
60 -> ['regarding', 'concerning', 'describing', 'aristotle', 'oral']
23 -> ['semi', 'showing', 'apart', 'gates', 'whilst']
35 -> ['honor', 'adam', 'abraham', 'muhammad', 'passage']
82 -> ['partially', 'locally', 'carefully', 'lacking', 'attraction']
12 -> ['moreover', 'consequently', 'besides', 'repeatedly', 'likewise']
36 -> ['independently', 'comprises', 'marking', 'excluding', 'astronomers']
39 -> ['associate', 'publishers', 'eds', 'chair', 'wesley']
44 -> ['thirteen', 'eighteen', 'culminating', 'seventy', 'narrowly']
2 -> ['ironically', 'nice', 'sixteen', 'stamp', 'nights']
43 -> ['onwards', 'popularly', 'atta', 'eighty', 'variously']
We need to illustrate how to pick best clusters, i.e. how to evaluate a cluster’s good or bad? We use the L2-norm of the vectors compared with the center vector in a cluster to evaluate.
Define score
Hence, the lower the score, the better the cluster.
1 | for n, cluster in clusters.items(): |
1 | sorted(clusters.items(), key = lambda x: x[1][1])[:5] # sortedby score ascending |
[(63,
(['january',
'march',
'december',
'july',
'june',
'november',
'april',
'september',
'august',
'october',
'february'],
165.74816360414897)),
(61, (['have', 'has', 'had', 'having'], 586.166584790995)),
(81,
(['more',
'most',
'very',
'less',
'particularly',
'too',
'highly',
'enough',
'relatively',
'quite',
'extremely',
'somewhat',
'increasingly',
'fairly'],
1517.4123485378213)),
(86,
(['b',
'd',
'born',
'actor',
'author',
'writer',
'singer',
'actress',
'composer',
'poet',
'musician',
'artist',
'politician',
'philosopher',
'mathematician',
'painter',
'journalist',
'footballer',
'novelist'],
2115.24780004326)),
(47,
(['de',
'l',
'al',
'la',
'san',
'paris',
'et',
'le',
'el',
'der',
'des',
'bwv',
'del',
'da',
'di',
'il',
'du',
'en',
'ma',
'santa',
'te',
'les',
'fran',
'ne',
'juan',
'und',
'sur'],
2692.603341114255))]
According to the above results, we pick up top 4 best clusters:
Cluster#63
=>{'january','march','december','july','june',november','april','september','august','october','february'}
- score = 165.75
- This cluster contains months
Cluster#61
=>{'have', 'has', 'had', 'having'}
- socre = 586.17
- This cluster contains different tense of word have
Cluster#81
=>{'more','most','very','less','particularly','too','highly','enough','relatively','quite','extremely','somewhat','fairly'}
- score = 1517.41
- They are some kinds of degree adverb
Cluster#86
=>{'b','d','born','actor','author','writer','singer','actress','composer','poet','musician','artist','politician','philosopher','mathematician','painter','journalist','footballer','novelist'}
- score = 2115.25
- This cluster contains many occupations