Iterative Clustering 2018-02-03: intermediate

First tests of unstable iterative clustering prototype.
This notebook is temporarily shared as static _iterative_clustering_2019-02-03_.html,
data shared via _iterative_clustering_2019-02-03_ folder.

Settings

In [1]:
import os, sys, time
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path: sys.path.append(module_path)
from src.grammar_learner.utl import UTC, kwa
from src.grammar_learner.read_files import check_dir
from src.grammar_learner.widgets import html_table
from src.grammar_learner.pqa_table import params
from src.grammar_learner.incremental_clustering import iterate
tmpath = module_path + '/tmp/'
check_dir(tmpath, True, 'none')
table = []
start = time.time()
out_dir = module_path + '/output/_iterative_clustering_' + str(UTC())[:10] + '_'
print(UTC(), ':: out_dir:\n', out_dir)
2019-02-03 08:20:30 UTC :: out_dir:
 /home/obaskov/py/language-learning/output/_iterative_clustering_2019-02-03_
In [2]:
corpus = 'POC-English-Amb'
dataset = 'MST-fixed-manually'
input_path = module_path +'/data/'+ corpus +'/'+ dataset
ref_corpus = module_path + '/data/POC-English-Amb/MST-fixed-manually/poc-english_ex-parses-gold.txt'
kwargs = {
    # Corpora: 
    'corpus'        : 'POC-English-Amb',
    'dataset'       : 'MST-fixed-manually',
    # 'input_parses': input_path    , # paths are set by 'corpus' and 'dataset'
    'reference_path': ref_corpus    ,
    'corpus_path'   : ref_corpus    , # corpus path = reference path 
    'module_path'   : module_path   , # language-learning dir (default)
    # Word space:
    'stop_words'    :   []          , # trash filter off
    'min_word_count':   1           ,
    'left_wall'     :   ''          ,
    'period'        :   False       ,
    'context'       :   2           , # disjunct-based word vector space
    'word_space'    :   'vectors'   , # "DRK"
    # Category learning:
    'clustering'    :   ('kmeans', 'kmeans++', 10),
    'cluster_range' :   (2,50,1,5)  ,
    'cluster_criteria'  : 'silhouette',
    'clustering_metric' : ('silhouette', 'cosine'),
    'categories_generalization' : 'off',
    # Grammar rules induction:
    'grammar_rules'         : 2     , # disjunct-based link grammar rules
    'rules_generalization'  : 'off' ,
    'rules_merge'           : 0.8   ,
    'rules_aggregation'     : 0.2   ,
    'top_level'             : 0.01  ,
    # Etc...:
    'out_path'      :   out_dir     ,
    'output_grammar':   out_dir     ,
    'tmpath'        :   tmpath      , 
    'verbose'       :   'min'       ,
    'template_path' :   'poc-turtle',
    'linkage_limit' :   1000        ,
    'iterations'    :   7
}
print(UTC()) #, ':: input_path:\n', input_path, '\nout_path:', kwargs['out_path'])
2019-02-03 08:20:30 UTC

Tests: "POC-English" corpus, "MST-fixed" dateset

Connectors-DRK-Connectors

In [3]:
%%capture
kwargs['context'] = 1
kwargs['grammar_rules'] = 1
table, re = iterate(**kwargs)
In [4]:
display(html_table(table))
print(re['project_directory'][42:-12])
IterationN clustersPAF1
13073%,0.78
21673%,0.76
3974%,0.76
4578%,0.74
52100%,0.67
_iterative_clustering_2019-02-03_/POC-English-Amb_MST-fixed-manually_cDRKc_no-gen

Connectors-DRK-Disjuncts

In [5]:
%%capture
kwargs['grammar_rules'] = 2
table, re = iterate(**kwargs)
In [6]:
display(html_table(table))
print(re['project_directory'][42:-12])
IterationN clustersPAF1
130100%,0.99
218100%,0.99
39100%,0.99
43100%,0.87
_iterative_clustering_2019-02-03_/POC-English-Amb_MST-fixed-manually_cDRKd_no-gen

Disjuncts-DRK-Disjuncts

In [7]:
%%capture
kwargs['context'] = 2
table, re = iterate(**kwargs)
In [8]:
display(html_table(table))
print(re['project_directory'][42:-12])
IterationN clustersPAF1
130100%,0.99
216100%,0.99
310100%,0.99
44100%,0.95
52100%,0.79
_iterative_clustering_2019-02-03_/POC-English-Amb_MST-fixed-manually_dDRKd_no-gen

Disjuncts-ILE-Disjuncts

In [9]:
%%capture
kwargs['word_space'] = 'discrete'
kwargs['clustering'] = 'group'
table, re = iterate(**kwargs)
In [10]:
display(html_table(table))
print(re['project_directory'][42:-12])
IterationN clustersPAF1
137100%,1.0
237100%,1.0
_iterative_clustering_2019-02-03_/POC-English-Amb_MST-fixed-manually_dILEd_no-gen

Disjuncts-ALE-Disjuncts

In [11]:
%%capture
kwargs['word_space'] = 'sparse'
kwargs['cluster_range'] = (2,36,1,1)
kwargs['clustering'] = ('agglomerative', 'ward')
table, re = iterate(**kwargs)
In [12]:
display(html_table(table))
print(re['project_directory'][42:-12])
IterationN clustersPAF1
121100%,1.0
25100%,0.87
_iterative_clustering_2019-02-03_/POC-English-Amb_MST-fixed-manually_dALWEd_no-gen