Iterative Clustering 2018-02-09

Checking "Incremental Iterative Clustering with Continuous Dimension Reduction" -- last page in the "ULL 2019 - Report".
This notebook is shared as Iterative-Clustering-4-sentences-2019-02-09.html,
output data -- Iterative-Clustering-4-sentences-2019-02-09.

Settings

In [1]:
import os, sys, time
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path: sys.path.append(module_path)
from src.grammar_learner.utl import UTC, kwa
from src.grammar_learner.read_files import check_dir
from src.grammar_learner.widgets import html_table
from src.grammar_learner.pqa_table import params
from src.grammar_learner.incremental_clustering import iterate
tmpath = module_path + '/tmp/'
check_dir(tmpath, True, 'none')
table = []
start = time.time()
out_dir = module_path + \
    '/output/Iterative-Clustering-4-sentences-' + str(UTC())[:10]
print(UTC(), ':: out_dir:\n', out_dir)
2019-02-10 09:02:21 UTC :: out_dir:
 /home/obaskov/94/language-learning/output/Iterative-Clustering-4-sentences-2018-02-09
In [2]:
corpus = '4_sentences'
dataset = 'sequence_1234'
input_path = module_path +'/data/'+ corpus +'/'+ dataset
ref_corpus = input_path
kwargs = {
    # Corpora: 
    'corpus'        : corpus        ,
    'dataset'       : dataset       ,
    # 'input_parses'  : input_path    , # paths are set by 'corpus' and 'dataset'
    # 'reference_path': ref_corpus    , # reference_path = input_parses
    # 'corpus_path'   : ref_corpus    , # corpus path = reference path 
    'module_path'   : module_path   , # language-learning dir (default)
    # Word space:
    'stop_words'    :   []          , # trash filter off
    'min_word_count':   1           ,
    'left_wall'     :   ''          ,
    'period'        :   False       ,
    'context'       :   2           , # disjunct-based word vector space
    'word_space'    :   'discrete'  , # "ILE"
    # Category learning:
    'clustering'    :   'group'     , # 
    'cluster_range' :   0           , # not used in ILE, can be used to mark dirs
    'cluster_criteria'  : 'silhouette',
    'clustering_metric' : ('silhouette', 'cosine'),
    'categories_generalization' : 'off',
    # Grammar rules induction:
    'grammar_rules'         : 2     , # disjunct-based link grammar rules
    'rules_generalization'  : 'off' ,
    'rules_merge'           : 0.8   ,
    'rules_aggregation'     : 0.2   ,
    'top_level'             : 0.01  ,
    # Etc...:
    'out_path'      :   out_dir     ,
    'output_grammar':   out_dir     ,
    'tmpath'        :   tmpath      , 
    'verbose'       :   'min'       ,
    'template_path' :   'poc-turtle',
    'linkage_limit' :   1000        ,
    'iterations'    :   12
}
if check_dir(input_path): print(UTC(), ':: input_path:\n', input_path)
2019-02-10 09:02:21 UTC :: input_path:
 /home/obaskov/94/language-learning/data/4_sentences/sequence_1234

4 sentences -- original parses

In [3]:
kwargs['cluster_range'] = 6  # just marking dir
with open(input_path + '/4_sentences_1234.ull', 'r') as f:
    lines = f.read().splitlines()
for line in lines: print(line)
tuna is a fish
1 tuna 2 is
2 is 3 a
3 a 4 fish

parrot is a bird
1 parrot 2 is
2 is 3 a
3 a 4 bird

parrot eats a seed
1 parrot 2 eats
2 eats 3 a
3 a 4 seed

parrot eats a corn
1 parrot 2 eats
2 eats 3 a
3 a 4 corn

Connector-based word space and rules (cILEc)

In [4]:
%%capture
kwargs['context'] = 1        # connector-based word space
kwargs['grammar_rules'] = 1  # connector-based grammar rules
t1, re1 = iterate(**kwargs)
In [5]:
display(html_table(t1)); print(re1['project_directory'][42:-12])
IterationN clustersPAF1
16100%,1.0
26100%,1.0
Iterative-Clustering-4-sentences-2018-02-09/4_sentences_sequence_1234_cILEc_no-gen_6c

Connector-based space, disjunct-based rules (cILEd)

In [6]:
%%capture
kwargs['context'] = 1        # connector-based word space
kwargs['grammar_rules'] = 2  # disjunct-based grammar rules
t2, re2 = iterate(**kwargs)
In [7]:
display(html_table(t2)); print(re2['project_directory'][42:-12])
IterationN clustersPAF1
16100%,1.0
26100%,1.0
Iterative-Clustering-4-sentences-2018-02-09/4_sentences_sequence_1234_cILEd_no-gen_6c

Disjunct-based word space and rules (dILEd)

In [8]:
%%capture
kwargs['context'] = 2        # disjunct-based word space
kwargs['grammar_rules'] = 2  # disjunct-based grammar rules
t3, re3 = iterate(**kwargs)
In [9]:
display(html_table(t3)); print(re3['project_directory'][42:-12])
IterationN clustersPAF1
16100%,1.0
26100%,1.0
Iterative-Clustering-4-sentences-2018-02-09/4_sentences_sequence_1234_dILEd_no-gen_6c

Modified dataset

Last line modification: parrot (eats...) ⇒ tuna ... to enable grouping tuna and parrot

In [10]:
kwargs['dataset'] = 'modified'
kwargs['cluster_range'] = 4  # just marking dir
with open(module_path + '/data/4_sentences/modified/modified_1234.ull', 'r') as f:
    lines = f.read().splitlines()
for line in lines: print(line)
tuna is a fish
1 tuna 2 is
2 is 3 a
3 a 4 fish

parrot is a bird
1 parrot 2 is
2 is 3 a
3 a 4 bird

parrot eats a seed
1 parrot 2 eats
2 eats 3 a
3 a 4 seed

tuna eats a corn
1 tuna 2 eats
2 eats 3 a
3 a 4 corn

In [11]:
%%capture
kwargs['context'] = 2        # disjunct-based word space
kwargs['grammar_rules'] = 2  # disjunct-based grammar rules
t4, re4 = iterate(**kwargs)
In [12]:
display(html_table(t4)); print(re4['project_directory'][42:-12])
IterationN clustersPAF1
14100%,1.0
24100%,1.0
Iterative-Clustering-4-sentences-2018-02-09/4_sentences_modified_dILEd_no-gen_4c
In [13]:
%%capture
kwargs['context'] = 1        # connector based word space
kwargs['grammar_rules'] = 1  # connector based grammar rules
t5, re5 = iterate(**kwargs)
In [14]:
display(html_table(t5)); print(re5['project_directory'][42:-12])
IterationN clustersPAF1
14100%,1.0
24100%,1.0
Iterative-Clustering-4-sentences-2018-02-09/4_sentences_modified_cILEc_no-gen_4c
In [15]:
%%capture
kwargs['context'] = 1        # connector based word space
kwargs['grammar_rules'] = 2  # disjunct based grammar rules
t6, re6 = iterate(**kwargs)
In [16]:
display(html_table(t6)); print(re6['project_directory'][42:-12])
IterationN clustersPAF1
14100%,1.0
24100%,1.0
Iterative-Clustering-4-sentences-2018-02-09/4_sentences_modified_cILEd_no-gen_4c