Clustering: KLE -- k-means, MLE -- mean shift `2018-11-10`¶

This notebook is shared as static LE-clustering-KLE-MLE-2018-11-10_.html.
Data: LE-clustering-KLE-MLE-2018-11-10_.

Basic settings¶

import os, sys, time
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path: sys.path.append(module_path)
from src.grammar_learner.utl import UTC
from src.grammar_learner.read_files import check_dir
from src.grammar_learner.write_files import list2file
from src.grammar_learner.widgets import html_table
from src.grammar_learner.pqa_table import table_rows
tmpath = module_path + '/tmp/'
check_dir(tmpath, True, 'none')
table = []
long_table = []
start = time.time()
print(UTC(), ':: module_path =', module_path)

2018-11-10 10:08:18 UTC :: module_path = /home/obaskov/94/language-learning

Corpus test settings¶

out_dir = module_path + '/output/LE-clustering-KLE-MLE-' + str(UTC())[:10] + '_'
# corpus = 'CDS-caps-br-text+brent9mos'
corpus = 'CDS-caps-br-text'
# dataset = 'LG-English'
dataset = 'LG-English-clean-clean'  # only 100% parsed sentences
runs = (1,1)
if runs != (1,1): out_dir += '-multi'
kwargs = {
    'left_wall'     :   ''          ,
    'period'        :   False       ,
    'context'       :   2           ,
    'min_word_count':   1           ,
    'min_link_count':   1           ,
    'max_words'     :   100000      ,
    'max_features'  :   100000      ,
    'min_co-occurrence_count':  1   ,
    'word_space'    :   'sparse'    ,
    'clustering'    :   ('agglomerative', 'ward'),
    'cluster_range' :   200         ,
    'clustering_metric' : ('silhouette', 'cosine'),
    'grammar_rules' :   2           ,
    'max_disjuncts' :   100000      ,
    'tmpath'        :   tmpath      , 
    'template_path' :   'poc-turtle',
    'linkage_limit' :   1000        ,
    'verbose'       :   'min'       }
lines = [
    [33, corpus , 'LG-English'                     ,0,0, 'none'  ], 
    [34, corpus , 'LG-English'                     ,0,0, 'rules' ], 
    [35, corpus , 'R=6-Weight=6:R-mst-weight=+1:R' ,0,0, 'none'  ], 
    [36, corpus , 'R=6-Weight=6:R-mst-weight=+1:R' ,0,0, 'rules' ]]
# rp = module_path + '/data/CDS-caps-br-text+brent9mos/LG-English'
# rp = module_path + '/data/CDS-caps-br-text/LG-English'  # shorter test 81025
rp = module_path + '/data/CDS-caps-br-text/LG-English-clean-clean'
cp = rp  # corpus path = reference_path :: use 'gold' parses as test corpus

Tests with "clean" training and test sets¶

corpus = 'CDS-caps-br-text'
dataset = 'LG-English-clean-clean'  # only 100% parsed sentences
lines = [[33, corpus, 'LG-English-clean-clean',0, 0, 'none'], 
         [34, corpus, 'LG-English-clean-clean',0, 0, 'rules']]

Agglomerative clustering¶

%%capture
a, _, header = table_rows(lines, out_dir, cp, rp, runs, **kwargs)
table.extend(a)

display(html_table([header] + a))

K-means clustering, 200 clusters¶

%%capture
kwargs['clustering'] = ('k-means', 'kmeans++', 10)
k, _, header = table_rows(lines, out_dir, cp, rp, runs, **kwargs)
table.extend(k)

display(html_table([header] + k))

Mean shift clustering¶

%%capture
kwargs['clustering'] = ('mean_shift', 2)
m, _, header = table_rows(lines, out_dir, cp, rp, runs, **kwargs)
table.extend(m)

display(html_table([header] + m))

Training with basic corpus¶

Both "br-text" and "brent9mos" corpora, no cleanup

corpus = 'CDS-caps-br-text+brent9mos'
dataset = 'LG-English'
rp = module_path + '/data/CDS-caps-br-text/LG-English-clean-clean'
cp = rp  # corpus path = reference_path
lines = [
    [33, corpus , 'LG-English'                     ,0,0, 'none'  ], 
    [34, corpus , 'LG-English'                     ,0,0, 'rules' ], 
    [35, corpus , 'R=6-Weight=6:R-mst-weight=+1:R' ,0,0, 'none'  ], 
    [36, corpus , 'R=6-Weight=6:R-mst-weight=+1:R' ,0,0, 'rules' ]]

Agglomerative clustering¶

%%capture
kwargs['clustering'] = ('agglomerative', 'ward')
a2, _, header = table_rows(lines, out_dir, cp, rp, runs, **kwargs)
table.extend(a2)

display(html_table([header] + a2))

K-means clustering, 200 clusters¶

%%capture
kwargs['clustering'] = ('k-means', 'kmeans++', 10)
k2, _, header = table_rows(lines, out_dir, cp, rp, runs, **kwargs)
table.extend(k2)

display(html_table([header] + k2))

Mean shift clustering¶

%%capture
kwargs['clustering'] = ('mean_shift', 2)
m2, _, header = table_rows(lines, out_dir, cp, rp, runs, **kwargs)
table.extend(m2)

display(html_table([header] + m2))

All tests¶

display(html_table([header] + table))

print(UTC(), ':: finished, elapsed', str(round((time.time()-start)/3600.0, 1)), 'hours')
table_str = list2file(table, out_dir + '/short_table.txt')
print('Results saved to', out_dir + '/short_table.txt')

2018-11-10 14:22:43 UTC :: finished, elapsed 4.2 hours
Results saved to /home/obaskov/94/language-learning/output/LE-clustering-KLE-MLE-2018-11-10_/short_table.txt

Line	Corpus	Parsing	LW	RW	Gen.	Space	Rules	Silhouette	PA	PQ	F1
33	CDS-caps-br-text	LG-English-clean-clean	---	---	none	dALEd	200	---	99%	98%	0.98
34	CDS-caps-br-text	LG-English-clean-clean	---	---	rules	dALEd	180	---	99%	97%	0.98

Line	Corpus	Parsing	LW	RW	Gen.	Space	Rules	Silhouette	PA	PQ	F1
33	CDS-caps-br-text	LG-English-clean-clean	---	---	none	dKLEd	200	---	99%	98%	0.98
34	CDS-caps-br-text	LG-English-clean-clean	---	---	rules	dKLEd	182	---	99%	97%	0.98

Line	Corpus	Parsing	LW	RW	Gen.	Space	Rules	Silhouette	PA	PQ	F1
33	CDS-caps-br-text	LG-English-clean-clean	---	---	none	dMLEd	309	---	99%	97%	0.97
34	CDS-caps-br-text	LG-English-clean-clean	---	---	rules	dMLEd	289	---	99%	97%	0.97

Line	Corpus	Parsing	LW	RW	Gen.	Space	Rules	Silhouette	PA	PQ	F1
33	CDS-caps-br-text+brent9mos	LG-English	---	---	none	dALEd	200	---	99%	95%	0.96
34	CDS-caps-br-text+brent9mos	LG-English	---	---	rules	dALEd	170	---	99%	94%	0.94
35	CDS-caps-br-text+brent9mos	R=6-Weight=6:R-mst-weight=+1:R	---	---	none	dALEd	200	---	79%	48%	0.50
36	CDS-caps-br-text+brent9mos	R=6-Weight=6:R-mst-weight=+1:R	---	---	rules	dALEd	200	---	79%	48%	0.50

Line	Corpus	Parsing	LW	RW	Gen.	Space	Rules	Silhouette	PA	PQ	F1
33	CDS-caps-br-text+brent9mos	LG-English	---	---	none	dKLEd	200	---	99%	96%	0.96
34	CDS-caps-br-text+brent9mos	LG-English	---	---	rules	dKLEd	170	---	99%	93%	0.94
35	CDS-caps-br-text+brent9mos	R=6-Weight=6:R-mst-weight=+1:R	---	---	none	dKLEd	200	---	81%	48%	0.50
36	CDS-caps-br-text+brent9mos	R=6-Weight=6:R-mst-weight=+1:R	---	---	rules	dKLEd	200	---	80%	48%	0.50

Clustering: KLE -- k-means, MLE -- mean shift 2018-11-10¶

Basic settings¶

Corpus test settings¶

Tests with "clean" training and test sets¶

Agglomerative clustering¶

K-means clustering, 200 clusters¶

Mean shift clustering¶

Training with basic corpus¶

Agglomerative clustering¶

K-means clustering, 200 clusters¶

Mean shift clustering¶

All tests¶

Clustering: KLE -- k-means, MLE -- mean shift `2018-11-10`¶