Identify Clusters

offers a means of provisionally identifying nodal words in a given text, according to the density of their co-occurences within a selected unit of context, to provide the basis for a provisional  vocabulary list for use with Hamlet II - Joint Frequencies.  For this process to work convincingly, it is generally advisable first to apply a suitable stoplist to the text file, to ignore commonly occuring words which are unlikely to be of significance in determining its main content. The procedure automatically disregards numerals, equates upper and lower case initial letters and offers optional manual lemmatization, to reduce the number of entries to be considered in searching for collocations.

Words additionally disregarded during an application of this routine can also be saved separately, or optionally added to an existing stoplist, so that a few successive applications will quickly develop general stoplists for use with specific languages and bodies of text.

 ID_Clusters image

  • If the text to be read is in a language other than English, use the pull-down menu to apply the correct lexicographic conventions.

  • Stoplists can be selected and edited from the corresponding menu item. Alternatively, use the full vocabulary list editor to maintain your stoplists in detail.

  • A log file is displayed periodically, providing details of the procedures followed and ending with a list of the nodal items provisionally identified. These can be edited as required, viewed  by cluster analysis and/or plotted using MINISSA, and finally saved to form the basis of a provisional vocabulary list for use in Hamlet II - Joint Frequencies and other procedures.