Clustering: KLE -- k-means, MLE -- mean shift 2018-11-10

This notebook is shared as static LE-clustering-KLE-MLE-2018-11-10_.html.
Data: LE-clustering-KLE-MLE-2018-11-10_.

Basic settings

In [1]:
import os, sys, time
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path: sys.path.append(module_path)
from src.grammar_learner.utl import UTC
from src.grammar_learner.read_files import check_dir
from src.grammar_learner.write_files import list2file
from src.grammar_learner.widgets import html_table
from src.grammar_learner.pqa_table import table_rows
tmpath = module_path + '/tmp/'
check_dir(tmpath, True, 'none')
table = []
long_table = []
start = time.time()
print(UTC(), ':: module_path =', module_path)
2018-11-10 10:08:18 UTC :: module_path = /home/obaskov/94/language-learning

Corpus test settings

In [2]:
out_dir = module_path + '/output/LE-clustering-KLE-MLE-' + str(UTC())[:10] + '_'
# corpus = 'CDS-caps-br-text+brent9mos'
corpus = 'CDS-caps-br-text'
# dataset = 'LG-English'
dataset = 'LG-English-clean-clean'  # only 100% parsed sentences
runs = (1,1)
if runs != (1,1): out_dir += '-multi'
kwargs = {
    'left_wall'     :   ''          ,
    'period'        :   False       ,
    'context'       :   2           ,
    'min_word_count':   1           ,
    'min_link_count':   1           ,
    'max_words'     :   100000      ,
    'max_features'  :   100000      ,
    'min_co-occurrence_count':  1   ,
    'word_space'    :   'sparse'    ,
    'clustering'    :   ('agglomerative', 'ward'),
    'cluster_range' :   200         ,
    'clustering_metric' : ('silhouette', 'cosine'),
    'grammar_rules' :   2           ,
    'max_disjuncts' :   100000      ,
    'tmpath'        :   tmpath      , 
    'template_path' :   'poc-turtle',
    'linkage_limit' :   1000        ,
    'verbose'       :   'min'       }
lines = [
    [33, corpus , 'LG-English'                     ,0,0, 'none'  ], 
    [34, corpus , 'LG-English'                     ,0,0, 'rules' ], 
    [35, corpus , 'R=6-Weight=6:R-mst-weight=+1:R' ,0,0, 'none'  ], 
    [36, corpus , 'R=6-Weight=6:R-mst-weight=+1:R' ,0,0, 'rules' ]]
# rp = module_path + '/data/CDS-caps-br-text+brent9mos/LG-English'
# rp = module_path + '/data/CDS-caps-br-text/LG-English'  # shorter test 81025
rp = module_path + '/data/CDS-caps-br-text/LG-English-clean-clean'
cp = rp  # corpus path = reference_path :: use 'gold' parses as test corpus

Tests with "clean" training and test sets

In [3]:
corpus = 'CDS-caps-br-text'
dataset = 'LG-English-clean-clean'  # only 100% parsed sentences
lines = [[33, corpus, 'LG-English-clean-clean',0, 0, 'none'], 
         [34, corpus, 'LG-English-clean-clean',0, 0, 'rules']]

Agglomerative clustering

In [4]:
%%capture
a, _, header = table_rows(lines, out_dir, cp, rp, runs, **kwargs)
table.extend(a)
In [5]:
display(html_table([header] + a))
LineCorpusParsingLWRWGen.SpaceRulesSilhouettePAPQF1
33CDS-caps-br-textLG-English-clean-clean --- --- nonedALEd200 --- 99%98%0.98
34CDS-caps-br-textLG-English-clean-clean --- --- rulesdALEd180 --- 99%97%0.98

K-means clustering, 200 clusters

In [6]:
%%capture
kwargs['clustering'] = ('k-means', 'kmeans++', 10)
k, _, header = table_rows(lines, out_dir, cp, rp, runs, **kwargs)
table.extend(k)
In [7]:
display(html_table([header] + k))
LineCorpusParsingLWRWGen.SpaceRulesSilhouettePAPQF1
33CDS-caps-br-textLG-English-clean-clean --- --- nonedKLEd200 --- 99%98%0.98
34CDS-caps-br-textLG-English-clean-clean --- --- rulesdKLEd182 --- 99%97%0.98

Mean shift clustering

In [8]:
%%capture
kwargs['clustering'] = ('mean_shift', 2)
m, _, header = table_rows(lines, out_dir, cp, rp, runs, **kwargs)
table.extend(m)
In [9]:
display(html_table([header] + m))
LineCorpusParsingLWRWGen.SpaceRulesSilhouettePAPQF1
33CDS-caps-br-textLG-English-clean-clean --- --- nonedMLEd309 --- 99%97%0.97
34CDS-caps-br-textLG-English-clean-clean --- --- rulesdMLEd289 --- 99%97%0.97

Training with basic corpus

Both "br-text" and "brent9mos" corpora, no cleanup

In [10]:
corpus = 'CDS-caps-br-text+brent9mos'
dataset = 'LG-English'
rp = module_path + '/data/CDS-caps-br-text/LG-English-clean-clean'
cp = rp  # corpus path = reference_path
lines = [
    [33, corpus , 'LG-English'                     ,0,0, 'none'  ], 
    [34, corpus , 'LG-English'                     ,0,0, 'rules' ], 
    [35, corpus , 'R=6-Weight=6:R-mst-weight=+1:R' ,0,0, 'none'  ], 
    [36, corpus , 'R=6-Weight=6:R-mst-weight=+1:R' ,0,0, 'rules' ]]

Agglomerative clustering

In [11]:
%%capture
kwargs['clustering'] = ('agglomerative', 'ward')
a2, _, header = table_rows(lines, out_dir, cp, rp, runs, **kwargs)
table.extend(a2)
In [12]:
display(html_table([header] + a2))
LineCorpusParsingLWRWGen.SpaceRulesSilhouettePAPQF1
33CDS-caps-br-text+brent9mosLG-English --- --- nonedALEd200 --- 99%95%0.96
34CDS-caps-br-text+brent9mosLG-English --- --- rulesdALEd170 --- 99%94%0.94
35CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R --- --- nonedALEd200 --- 79%48%0.50
36CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R --- --- rulesdALEd200 --- 79%48%0.50

K-means clustering, 200 clusters

In [13]:
%%capture
kwargs['clustering'] = ('k-means', 'kmeans++', 10)
k2, _, header = table_rows(lines, out_dir, cp, rp, runs, **kwargs)
table.extend(k2)
In [14]:
display(html_table([header] + k2))
LineCorpusParsingLWRWGen.SpaceRulesSilhouettePAPQF1
33CDS-caps-br-text+brent9mosLG-English --- --- nonedKLEd200 --- 99%96%0.96
34CDS-caps-br-text+brent9mosLG-English --- --- rulesdKLEd170 --- 99%93%0.94
35CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R --- --- nonedKLEd200 --- 81%48%0.50
36CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R --- --- rulesdKLEd200 --- 80%48%0.50

Mean shift clustering

In [15]:
%%capture
kwargs['clustering'] = ('mean_shift', 2)
m2, _, header = table_rows(lines, out_dir, cp, rp, runs, **kwargs)
table.extend(m2)
In [16]:
display(html_table([header] + m2))
LineCorpusParsingLWRWGen.SpaceRulesSilhouettePAPQF1
33CDS-caps-br-text+brent9mosLG-English --- --- nonedMLEd1073 --- 99%97%0.98
34CDS-caps-br-text+brent9mosLG-English --- --- rulesdMLEd1038 --- 99%97%0.97
35CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R --- --- nonedMLEd1406 --- 73%46%0.49
36CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R --- --- rulesdMLEd1405 --- 73%46%0.49

All tests

In [17]:
display(html_table([header] + table))
LineCorpusParsingLWRWGen.SpaceRulesSilhouettePAPQF1
33CDS-caps-br-textLG-English-clean-clean --- --- nonedALEd200 --- 99%98%0.98
34CDS-caps-br-textLG-English-clean-clean --- --- rulesdALEd180 --- 99%97%0.98
33CDS-caps-br-textLG-English-clean-clean --- --- nonedKLEd200 --- 99%98%0.98
34CDS-caps-br-textLG-English-clean-clean --- --- rulesdKLEd182 --- 99%97%0.98
33CDS-caps-br-textLG-English-clean-clean --- --- nonedMLEd309 --- 99%97%0.97
34CDS-caps-br-textLG-English-clean-clean --- --- rulesdMLEd289 --- 99%97%0.97
33CDS-caps-br-text+brent9mosLG-English --- --- nonedALEd200 --- 99%95%0.96
34CDS-caps-br-text+brent9mosLG-English --- --- rulesdALEd170 --- 99%94%0.94
35CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R --- --- nonedALEd200 --- 79%48%0.50
36CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R --- --- rulesdALEd200 --- 79%48%0.50
33CDS-caps-br-text+brent9mosLG-English --- --- nonedKLEd200 --- 99%96%0.96
34CDS-caps-br-text+brent9mosLG-English --- --- rulesdKLEd170 --- 99%93%0.94
35CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R --- --- nonedKLEd200 --- 81%48%0.50
36CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R --- --- rulesdKLEd200 --- 80%48%0.50
33CDS-caps-br-text+brent9mosLG-English --- --- nonedMLEd1073 --- 99%97%0.98
34CDS-caps-br-text+brent9mosLG-English --- --- rulesdMLEd1038 --- 99%97%0.97
35CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R --- --- nonedMLEd1406 --- 73%46%0.49
36CDS-caps-br-text+brent9mosR=6-Weight=6:R-mst-weight=+1:R --- --- rulesdMLEd1405 --- 73%46%0.49
In [18]:
print(UTC(), ':: finished, elapsed', str(round((time.time()-start)/3600.0, 1)), 'hours')
table_str = list2file(table, out_dir + '/short_table.txt')
print('Results saved to', out_dir + '/short_table.txt')
2018-11-10 14:22:43 UTC :: finished, elapsed 4.2 hours
Results saved to /home/obaskov/94/language-learning/output/LE-clustering-KLE-MLE-2018-11-10_/short_table.txt