ILE clustering: no clusters with high frequency words?

Explain or discuss the fact that ILE cluster formation takes place only with MWC=1 -- UPP Project plan.

"Gutenberg Children Books" corpus, new "LG-E-noQuotes" dataset (GC_LGEnglish_noQuotes_fullyParsed.ull),
trash filter off: min_word_count = 1; max_sentence_length off; Link Grammar 5.5.1.

This notebook is shared as static ILE-clustering-research-GCB-LG-E-noQuotes-2019-04-19.html.

Basic settings

In [1]:
import os, sys, time, numpy as np, pandas as pd
from collections import Counter
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path: sys.path.append(module_path)
from src.grammar_learner.utl import UTC, kwa, test_stats
from src.grammar_learner.read_files import check_dir, check_mst_files
from src.grammar_learner.write_files import list2file
from src.grammar_learner.widgets import html_table
from src.grammar_learner.preprocessing import filter_links
tmpath = module_path + '/tmp/'; check_dir(tmpath, True, 'none')
print(UTC(), ':: module_path:', module_path)
2019-04-20 13:12:11 UTC :: module_path: /home/obaskov/94/ULL

Corpus test settings

In [2]:
corpus = 'GCB' # 'Gutenberg-Children-Books-Caps' 
dataset = 'LG-E-noQuotes'
input_parses = module_path + '/data/GCB/LG-E-noQuotes/'
kwargs = {
    'left_wall'     :   ''          ,
    'period'        :   False       ,
    'context'       :   2           ,
    'min_word_count':   1           ,
    'word_space'    :   'discrete'  ,
    'clustering'    :   'group'     ,
    'cluster_range' :   [0]         ,
    'top_level'     :   0.01        ,
    'grammar_rules' :   2           ,
    'max_disjuncts' :   1000000     ,
    'stop_words'    :   []          ,
    'tmpath'        :   tmpath      ,
    'verbose'       :   '+'      ,
    'template_path' :   'poc-turtle',
    'linkage_limit' :   1000        }
rp = module_path + '/data/' + corpus + '/LG-E-noQuotes'
cp = rp  # corpus path = reference_path
out_dir = module_path + '/output/' \
    + 'ILE-clustering-GCB-LG-E-noQuotes-' + str(UTC())[:10]
kwargs['output_grammar'] = out_dir
check_dir(out_dir, True)
print(UTC(), '\n', out_dir)
2019-04-20 13:12:12 UTC 
 /home/obaskov/94/ULL/output/ILE-clustering-GCB-LG-E-noQuotes-2019-04-20

Parses ⇒ links ⇒ words

In [3]:
files, re01 = check_mst_files(input_parses, 'max')
kwargs['input_files'] = files; files
Out[3]:
['/home/obaskov/94/ULL/data/GCB/LG-E-noQuotes/GC_LGEnglish_noQuotes_fullyParsed.ull']
In [4]:
links, re02 = filter_links(files, **kwargs)
print(len(links), 'unique links (pairs of linked words in parses)')
links[['word', 'link', 'count']].head()
433259 unique links (pairs of linked words in parses)
Out[4]:
word link count
0 , and+ 6601
1 , but+ 1471
2 it was+ 1405
3 not did- 822
4 he and+ 766
In [5]:
words = links[['word', 'count']].groupby('word').agg({'count': 'sum'}).reset_index()
print(len(words), 'unique words in links (pairs of linked words in parses)')
words.head()
22067 unique words in links (pairs of linked words in parses)
Out[5]:
word count
0 ! 190
1 $ 6
2 & 6
3 ' 193
4 'd 339

Word counts: dict: {word: number_of_observations}

In [6]:
word_counts = words.set_index('word').to_dict()['count']
print(len(word_counts), 'words total,\n',
      len([w for w,c in word_counts.items() if c < 2]), 'words observed only once,\n', 
      len([w for w,c in word_counts.items() if c == 2]), 'words observed twice,\n', 
      len([w for w,c in word_counts.items() if c > 2]), 'words observed more than twice')
22067 words total,
 8614 words observed only once,
 3271 words observed twice,
 10182 words observed more than twice

Links ⇒ disjuncts ⇒ clusters

In [7]:
df = links[['word', 'link', 'count']].copy()
df['disjuncts'] = [[x] for x in df['link']]
del df['link']
df = df.groupby('word').agg({'disjuncts': 'sum', 'count': 'sum'}).reset_index()
df['words'] = [[x] for x in df['word']]
del df['word']
df['disjuncts'] = df['disjuncts'].apply(lambda x: tuple(sorted(x)))
df[['words', 'disjuncts', 'count']].head()
Out[7]:
words disjuncts count
0 [!] (!-, !- & crack+ & !+, !- & musket-shots+ & fl... 190
1 [$] (price- & 1.25+, price- & 1.50+) 6
2 [&] (grosset- & dunlap+, marshall- & company+, mar... 6
3 ['] (,- & struggles- & gainst+, an- & sheep-bells+... 193
4 ['d] (anybody- & have+, emily- & been+, he- & and+,... 339

Disjuncts

In [8]:
dj_list = df['disjuncts'].tolist()
djset = set(df['disjuncts'].tolist())
print(len(djset), 'unique disjuncts')
18939 unique disjuncts
In [9]:
rules = df[['words', 'disjuncts']].groupby('disjuncts')['words'].apply(sum) \
    .reset_index().copy().rename(columns = {'words': 'cluster_words'})
print(len(rules), 'grammar rules after clustering disjuncts')
18939 grammar rules after clustering disjuncts
In [10]:
rules[['cluster_words', 'disjuncts']].head()
Out[10]:
cluster_words disjuncts
0 [ahem] (!+ & grandma+,)
1 [no] (!+ & one+, ,+ & and+, ,+ & anne+, ,+ & beast+...
2 [!] (!-, !- & crack+ & !+, !- & musket-shots+ & fl...
3 [crack] (!-, 's- & a-, ability- & to-, and- & that- & ...
4 [@number@] (!-, 's- & chain-, 's- & daughter-, 's- & fort...

Clusters longer than 1 word

In [11]:
cluster_list = [c for c in rules['cluster_words'].tolist() if len(c) > 1]
print(len(cluster_list), 'clusters contain 2 or more words;', 
      len(rules) - len(cluster_list), '"clusters" are single-word')
661 clusters contain 2 or more words; 18278 "clusters" are single-word
In [12]:
print('Random 12 clusters:\n')
for c in cluster_list[:12]: print(c)
Random 12 clusters:

['1.25', '1.50']
['beckwiths', 'marooners']
['balancin', 'blind-man', 'bumpsterhausen', 'ceylon', 'clarkman', 'ewald', 'georgie', 'gerald', 'gleeson', 'hobson', 'holloway', 'italy', 'janey', 'jeannie', 'luella', 'malley', 'matey', 'midge', 'minot', 'morison', 'netty', 'niagara', 'strutt', 'theer', 'toady', 'toff', 'tom-and-kate', 'ventnor', 'webster']
['bruk', 'deadliest', 'forty-three', 'living-address', 'nemesis', 'occurrences', 'peculiarities', 'relict', 'riddles', 'setness', 'seventy-six']
['up-wind', 'upwind']
['aylmer', 'bask', 'cold-eyed', 'geneva', 'long-concealed', 'morrice-dancers', 'odour-freighted', 'paperarello', 'pistils', 'plaster-worker', 'postage-stamps', 'psychological', 'saddles', 'sniffle', 'treasons', 'volaterrae']
['anguished', 'deep-set']
['1871-72', 'anne-girl', 'apes', 'bev', 'bibles', 'blackberries', 'boot', 'bournemouth', 'bridles', 'bristling', 'carefree', 'centaurs', 'chillon', 'davie', 'depression', 'double-seated', 'dry-eyed', 'e.c', 'eb', 'ede', 'editeur', 'fishy-like', 'flecked', 'fleetly', 'germs', 'goldenly-glad', 'icelandic', 'isobel', 'langsyne', 'ld', 'litill', 'long-bearded', 'lubber', "ma'am", 'minion', 'misled', 'nets', 'panthers', 'plantain', 'privation', 'roofed', 'sar', 'scared-like', 'set-lipped', 'shrined', 'sirs', 'stags', 'stratagems', 'studded', 'talon', 'tamarinds', 'theatres', 'u.s.a.', 'unintimidated', 'virginal', 'vol', 'wilfy', 'winnie', 'yawning']
['esme', 'millie']
['brocades', 'cinnamon']
['prey', 'purply-black']
['holmes', 'langley']
In [13]:
cluster_sizes = Counter([len(c) for c in cluster_list])
print('Cluster sizes observed more than once:')
display(html_table([['Cluster size', 'Number of clusters']] + 
                   sorted([[s,n] for s,n in cluster_sizes.items() if n > 1], 
                          key = lambda x : x[1], reverse = True)))
Cluster sizes observed more than once:
Cluster sizeNumber of clusters
2352
3102
463
532
625
717
99
136
105
114
154
144
84
163
263
123
193
212
In [14]:
print('Cluster sizes observed only once:\n', sorted([s for s,n in cluster_sizes.items() if n < 2]))
Cluster sizes observed only once:
 [17, 20, 22, 25, 29, 34, 36, 38, 39, 40, 43, 49, 51, 54, 55, 59, 73, 162, 173, 417]

Words in clusters

In [15]:
# unique words in clusters:
words_in_clusters = set([w for c in cluster_list for w in c])
print(len(words_in_clusters), 'unique words in clusters', 
      '-- 100 randomly chosen samples:\n\n', list(words_in_clusters)[:100])
3789 unique words in clusters -- 100 randomly chosen samples:

 ['well-shaped', 'nutting', 'jack-sparrow', 'minx', "prigio's", 'fameuse', 'setters', 'tapping', 'compass-needle', 'stags', 'hamper', 'heightened', 'queerer', 'nantes', 'nyamatsanes', 'neighing', 'piave', 'plodded', 'bishop', 'stuttered', 'conquests', 'steak', 'braved', 'unintimidated', 'fiercer', 'thirty-nine', 'resident', 'vinegar', 'light-heartedness', 'restful', 'wind-ruddy', 'toff', 'pathways', 'carney', 'feasible', 'realism', 'crowsfeet', 'outrageous', 'pook', 'lilac-bush', 'ill-health', 'quick-spread', 'county', 'horoscopes', 'klondike', 'speculative', 'daredevil', 'thomson', 'puny', 'bournemouth', 'carnations', 'notebooks', 'vol', 'corroding', 'interfering', 'kootenay', 'callender', 'fag-end', 'mirrored', 'testily', 'invader', 'lantern-light', 'majestic', 'aforethought', 'prop', 'picnicked', 'neatness', 'govor', 'crisply', 'sun-struck', 'prometheus', 'prescription', 'annual', 'lackadiasical', 'shopped', 'complied', 'half-drowned', "ha'nted", 'robe-edge', 'buckwheats', 'chrissie', 'kettley', 'workers', 'archives', 'gul-rukh', 'curtainless', 'gooseberries', 'stilled', 'bagdad', 'impracticable', 'chary', 'earth-stars', 'kilter', 'elopements', 'out-at-elbows', 'pianno', 'friendliness', 'afflictions', 'cedar', 'heaviest']
In [16]:
# Numbers of cluster member words observations in the whole corpus
wcs = set([word_counts[w] for w in words_in_clusters]); wcs
Out[16]:
{1, 2, 3, 4, 5, 6, 7, 9}
In [17]:
clustered_word_counts = Counter([word_counts[w] for w in words_in_clusters])
clustered_word_counts
Out[17]:
Counter({1: 3416, 2: 289, 3: 49, 6: 9, 5: 11, 4: 12, 7: 1, 9: 2})

1: 3416 means 3416 words represented in clusters are observed once in the corpus,
2: 289 -- 289 words are observed twice, ... 2 most frequent clustered words are observed 9 times

In [18]:
print(str(int(round(clustered_word_counts[1]/len(words_in_clusters)*100,0))) + 
      '% words in clusters are observed only once in the whole input corpus:\n',
      clustered_word_counts[1], 'once observed words of', 
      len(words_in_clusters), 'total unique words in clusters.')
90% words in clusters are observed only once in the whole input corpus:
 3416 once observed words of 3789 total unique words in clusters.
In [19]:
clustered_words_counts = {w: word_counts[w] for c in cluster_list for w in c}
frequent_words_counts = {w:c for w,c in clustered_words_counts.items() if c > 2}
print('Number of observations of clustered words in the whole corpus',
      'for words observed more than twice:\n\n', frequent_words_counts)
Number of observations of clustered words in the whole corpus for words observed more than twice:

 {'1.50': 5, 'marooners': 4, 'malley': 3, 'ventnor': 3, 'anguished': 3, 'deep-set': 4, 'editeur': 3, "ma'am": 9, 'esme': 6, 'leroux': 3, 'xi': 9, 'xiv': 6, 'xix': 6, 'xv': 6, 'xxiii': 6, 'xxiv': 6, 'pined': 3, 'fretted': 3, 'haughty': 3, 'moped': 4, 'curdken': 3, 'andy': 4, 'gestures': 3, 'lettered': 3, 'pest': 3, 'triumphs': 3, 'unprofitable': 3, 'tatters': 3, 'quizzically': 3, 'aloofness': 4, 'bread-and-butter': 3, 'lamented': 4, 'profoundly': 3, 'unnecessary': 3, 'unsuspicious': 3, 'incisive': 5, 'winning': 3, 'hotels': 3, 'juncture': 5, 'foolscap': 5, 'seven-league': 3, 'xxv': 5, 'xxvii': 3, 'xxviii': 3, 'xxx': 3, 'xxxi': 3, 'hazel-nut': 4, 'many-furred': 3, 'nut-brown': 3, 'blackest': 4, 'attendance': 3, 'january': 3, 'manitoba': 6, 'vancouver': 3, 'trice': 3, 'daytime': 7, 'grate': 3, 'bazar': 5, 'center': 3, 'shifty': 3, 'sealskin': 6, 'shaws': 3, 'thankfulness': 3, 'yore': 3, 'checkers': 3, 'spices': 5, 'sinclairs': 3, 'enderly': 5, 'petersen': 4, 'sagely': 3, 'stony': 3, 'intervening': 3, "wendy's": 3, 'easiest': 4, 'funniest': 3, 'handsomest': 5, 'kindest': 4, 'lowest': 5, 'merest': 4, 'merriest': 3, 'pleasantest': 5, 'quickest': 3, 'jolliest': 6, 'softest': 3}

Clustering patterns

In [20]:
patterns = Counter([tuple(sorted(set([word_counts[w] for w in c]))) for c in cluster_list])
patterns
Out[20]:
Counter({(1, 5): 3,
         (1, 4): 3,
         (1, 2, 3): 6,
         (1,): 478,
         (1, 2): 120,
         (3, 4): 1,
         (1, 2, 3, 9): 1,
         (2, 6): 1,
         (2,): 14,
         (1, 3): 8,
         (6, 9): 1,
         (2, 3, 4): 2,
         (2, 3): 9,
         (1, 2, 3, 4): 1,
         (2, 4): 2,
         (3,): 1,
         (3, 5): 1,
         (1, 2, 5): 1,
         (1, 2, 3, 5): 1,
         (2, 3, 6): 1,
         (1, 2, 3, 7): 1,
         (2, 5): 2,
         (1, 6): 1,
         (1, 2, 3, 4, 5): 1,
         (3, 6): 1})

Comment: (1,5): 3 means that 3 clusters consist of words observed in the input corpus once or 5 times

Clusters of words observed more than once in the input corpus

In [21]:
once_observed_words = {w:c for w,c in word_counts.items() if c < 2}.keys()
filtered_cluster_list = [l for l in [[w for w in c if w not in once_observed_words] 
                         for c in cluster_list] if len(l) > 1]
print(len(cluster_list) - len(filtered_cluster_list),
      'clusters of', len(cluster_list), 
      'consist of words, observed only once in the input corpus,\n',
      len(filtered_cluster_list),
      'clusters of words, observed more than once in the input corpus:')
for l in filtered_cluster_list: print(l)
587 clusters of 661 consist of words, observed only once in the input corpus,
 74 clusters of words, observed more than once in the input corpus:
['bumpsterhausen', 'clarkman', 'holloway', 'malley', 'toady', 'ventnor']
['anguished', 'deep-set']
['editeur', "ma'am", 'roofed']
['esme', 'millie']
['brocades', 'cinnamon']
['p64.jpg', 'p99.jpg']
['xi', 'xiv', 'xix', 'xv', 'xxiii', 'xxiv']
['milkmaid', 'scraper']
['classics', 'jingling', 'jukes', 'oyster-shops', 'pined', 'railroad-hacks', 'stocks', 'tegumai', 'turley', '|marilla']
['....n', 'clucked', 'creaked', 'dukes', 'fretted', 'haughty', 'matured', 'moped', 'threshed', 'twittered']
['homeless', 'shingled']
['derision', 'mogarzea', 'wattle']
['curdken', 'untrue']
['andy', 'attributes', 'chow-chow', 'clambered', 'commented', 'crackling', 'fiddled', 'fume', 'gestures', 'lettered', 'magog', 'magsie', 'mature', 'militza', 'omnibuses', 'pathways', 'pest', 'precipitated', 'scheming', 'stilled', 'sue', 'top-heavy', 'triumphs', 'unprofitable']
['fisher-folk', 'praises']
['delirium', 'peshawur', 'seclusion', 'tatters', 'temperament']
['fiercer', 'piercingly', 'quizzically']
['aloofness', 'bread-and-butter', 'camels']
['lamented', 'snowshoeing']
['disdainfully', 'pleadingly']
['girlishly', 'profoundly']
['unnecessary', 'unsuspicious']
['fidgety', 'obedient']
['incisive', 'winning']
['bombay', 'mhow', 'nucklao', 'wish-ton-wish']
['boat-side', 'hotels']
['beauty-spots', 'dangerously']
['quixotic', 'totting']
['fidget', 'leak']
['foolscap', 'leather-bound']
['inlaid', 'innermost']
['assault', 'contraries', 'degrees', 'proxy', 'waltzing']
['xxv', 'xxvii', 'xxviii', 'xxx', 'xxxi', 'xxxii']
['many-furred', 'poor-spirited']
['bulging', 'horror-stricken']
['dark-brown', 'nut-brown']
['cloverside', 'heartsease']
['buda', 'luncheon']
['ragged-looking', 'snippy']
['iron-gray', 'wavy']
['blackest', 'whitest']
['attendance', 'denial', 'midstream', 'supplication', 'winnipeg']
['anatomy', 'hopelessness', 'january', 'norroway', 'switzerland']
['campden', 'fractions', 'manitoba', 'vancouver']
['daytime', 'grate', 'lurch']
['bazar', 'turret-room']
['center', 'mellowness']
['goldfish', 'spinning-top', 'traitor']
['gold-bearded', 'grey-haired', 'white-haired']
['braddon', 'brompton', 'carney']
['birch-bark', 'blefuscu', 'borkum', 'hamel', 'koumongoé', 'lehon', 'mesopotamia', 'mugger-ghaut', 'pinks', 'shaws', 'thankfulness', 'yore']
['checkers', 'croquet']
['civilization', 'khanhiwara', 'missions', 'navarre', 'taram-tāq', 'wales']
['brandy', 'impunity', 'shields', 'spices', 'unconcern']
['countryside', 'indies']
['arches', 'sinclairs', 'spy-glass']
['caucasus', 'gruagach']
['dial', 'divan']
['burnleys', 'gear', 'mary-annish']
['brush-grown', 'brushy']
['hospitably', 'rock-firm']
['bitterest', 'boldest', 'driest', 'easiest', 'fattest', 'fullest', 'funniest', 'grimmest', 'handsomest', 'kindest', 'laziest', 'lowest', 'merest', 'merriest', 'oddest', 'pleasantest', 'quickest', 'thinnest', 'tiniest', 'ugliest']
['jolliest', 'softest']
['penningtons', 'spencers']
['allans', 'verdure']
['garden-engine', 'lateness']
['greece', 'hack', 'marsden', 'procrastinate', 'saharunpore']
['byre', 'treadmill']
['bath-time', 'dense', 'gliding', 'twenty-eight']
['nugget', 'perambulator']
['bothering', 'shunning']
['chippings', 'trifled']
['gusto', 'honker']
['door-mat', 'water-butt']

Clusters of words observed more than twice in the input corpus

In [22]:
less_observed_words = {w:c for w,c in word_counts.items() if c < 3}.keys()
filtered_cluster_list = [l for l in [[w for w in c if w not in less_observed_words] 
                         for c in cluster_list] if len(l) > 1]
print(len(filtered_cluster_list),
      'clusters of words, observed more than twice in the input corpus:\n')
for l in filtered_cluster_list: print(l)
15 clusters of words, observed more than twice in the input corpus:

['malley', 'ventnor']
['anguished', 'deep-set']
['editeur', "ma'am"]
['xi', 'xiv', 'xix', 'xv', 'xxiii', 'xxiv']
['fretted', 'haughty', 'moped']
['andy', 'gestures', 'lettered', 'pest', 'triumphs', 'unprofitable']
['aloofness', 'bread-and-butter']
['unnecessary', 'unsuspicious']
['incisive', 'winning']
['xxv', 'xxvii', 'xxviii', 'xxx', 'xxxi']
['manitoba', 'vancouver']
['daytime', 'grate']
['shaws', 'thankfulness', 'yore']
['easiest', 'funniest', 'handsomest', 'kindest', 'lowest', 'merest', 'merriest', 'pleasantest', 'quickest']
['jolliest', 'softest']

Clusters of words observed more than 3x in the input corpus

In [23]:
less_observed_words = {w:c for w,c in word_counts.items() if c < 4}.keys()
filtered_cluster_list = [l for l in [[w for w in c if w not in less_observed_words] 
                         for c in cluster_list] if len(l) > 1]
print(len(filtered_cluster_list),
      'clusters of words, observed more than 3x in the input corpus:\n')
for l in filtered_cluster_list: print(l)
2 clusters of words, observed more than 3x in the input corpus:

['xi', 'xiv', 'xix', 'xv', 'xxiii', 'xxiv']
['easiest', 'handsomest', 'kindest', 'lowest', 'merest', 'pleasantest']

Highlights:

Unfiltered "Gutenberg Children Books" corpus "LG-E-noQuotes" dataset contains
22067 words, of which:

  • 8614 words are observed only once in the whole (unfiltered) corpus,
  • 3271 words observed twice,
  • 10182 words observed more than twice (46% of the corpus).

"Identical Lexical Entries" clustering provides 18939 grammar rules (clusters) containg
3789 unique words, 3416 (90%) of which are observed only once in the corpus.

  • 18278 "clusters" are single-word,
  • 661 clusters contain 2 or more words:
    • 587 clusters consist of words, observed only once in the input corpus,
    • 74 clusters of words, observed more than once,
    • 2 clusters of words, observed more than 3x.