corpkit: API readme

This file is a deprecated introduction to the corpkit Python API. It still exists because it contains a lot of useful information and advanced examples that are not found elsewhere. It is deprecated because better documentation is available at ReadTheDocs.

What's in here?
Installation
Quickstart
More detailed examples
- search, exclude and show
Working with coreferences
Building corpora
Concordancing
Systemic functional stuff
Keywording
- Visualising keywords
- Traditional reference corpora
Parallel processing
More complex queries and plots
- Visualisation options
Contact
Cite

What's in here?

Essentially, the module contains classes, methods and functions for building and interrogating corpora, then manipulating or visualising the results.

`Corpus()`

First, there's a Corpus() class, which models a corpus of CoreNLP XML, lists of tokens, or plaintext files, creating subclasses for subcorpora and corpus files.

To use it, simple feed it a path to a directory containing .txt files, or subfolders containing .txt files.

>>> from corpkit import Corpus
>>> unparsed = Corpus('path/to/data')

With the Corpus() class, the following attributes are available:

Attribute	Purpose
`corpus.subcorpora`	list of subcorpus objects with indexing/slicing methods
`corpus.features`	Corpus features (characters, clauses, words, tokens, process types, passives, etc.)
`corpus.postags`	Distribution of parts of speech
`corpus.wordclasses`	Distribution of word classes

as well as the following methods:

Method	Purpose
`corpus.parse()`	Create a parsed version of a plaintext corpus
`corpus.tokenise()`	Create a tokenised version of a plaintext corpus
`corpus.interrogate()`	Interrogate the corpus for lexicogrammatical features
`corpus.concordance()`	Concordance via lexis and/or grammar

Navigating `Corpus` objects

Once you've defined a Corpus, you can move around it very easily:

### corpus containing annual subcorpora of NYT articles
>>> corpus = Corpus('data/NYT-parsed')

>>> list(corpus.subcorpora)[:3]
### [<corpkit.corpus.Subcorpus instance: 1987>,
###  <corpkit.corpus.Subcorpus instance: 1988>,
###  <corpkit.corpus.Subcorpus instance: 1989>]

>>> corpus.subcorpora[0].path, corpus.subcorpora[0].datatype
### ('/Users/daniel/Work/risk/data/NYT-parsed/1987', 'parse')

>>> corpus.subcorpora.c1989.files[10:13]
### [<corpkit.corpus.File instance: NYT-1989-01-01-10-1.txt.xml>,
###  <corpkit.corpus.File instance: NYT-1989-01-01-10-2.txt.xml>,
###  <corpkit.corpus.File instance: NYT-1989-01-01-11-1.txt.xml>]

Most attributes, and the .interrogate() and .concordance() methods, can also be called on Subcorpus and File objects. File objects also have a .read() method.

`interrogate()` method

Use Tregex, regular expressions or wordlists to search parse trees, dependencies, token lists or plain text for complex lexicogrammatical phenomena
Search for, exclude and show word, lemma, POS tag, semantic role, governor, dependent, index (etc) of a token
N-gramming
Two-way UK-US spelling conversion
Output Pandas DataFrames that can be easily edited and visualised
Use parallel processing to search for a number of patterns, or search for the same pattern in multiple corpora
Restrict searches to particular speakers in a corpus
Works on collections of corpora, corpora, subcorpora, single files, or slices thereof
Quickly save to and load from disk with save() and load()

`concordance()` method

Equivalent to interrogate(), but return DataFrame of concordance lines
Return any combination and order of words, lemmas, indices, functions, or POS tags
Editable and saveable
Output to LaTeX, CSV or string with format()

The code below demonstrates the complex kinds of queries that can be handled by the interrogate() and concordance() methods:

### import * mostly so that we can access global variables like G, P, V
### otherwise, use 'w' instead of W, 'p' instead of P, etc. 
>>> from corpkit import *

### select parsed corpus
>>> corpus = Corpus('data/postcounts-parsed')

### import process type lists and closed class wordlists
>>> from corpkit.dictionaries import *

### match tokens with governor that is in relational process wordlist, 
### and whose function is `nsubj(pass)` or `csubj(pass)`:
>>> criteria = {GL: processes.relational.lemmata, F: r'^.subj'}

### exclude tokens whose part-of-speech is verbal, 
### or whose word is in a list of pronouns
>>> exc = {P: r'^V', W: wordlists.pronouns}

# interrogate, returning slash-delimited function/lemma
>>> data = corpus.interrogate(criteria, exclude=exc, show=[F,L])
>>> lines = corpus.concordance(criteria, exclude=exc, show=[F,L])

### show results
>>> print data, lines.format(n=10, window=40, columns=[L,M,R])

Output sample:

    nsubj/thing  nsubj/person  nsubj/problem  nsubj/way  nsubj/son
01          296           168            134         69         73   
02          233           147             88         70         70   
03          250           160             95         80         67   
04          247           205             88         93         71   
05          275           193             68         75         61   

0  nk nsubj/it cop/be ccomp/sad advmod/when    nsubj/person  aux/do neg/not advcl/look ./at prep_at/w
1  /my dobj/Fluoxetine advmod/now mark/that    nsubj/spring  ccomp/be advmod/here ./, ./but nsubj/I a
2  y mark/because expl/there advcl/be det/a     nsubj/woman  ./across det/the prep_across/hall ./from
3   num/114 ccomp/pound ./, mark/so det/any       nsubj/med  nsubj/I rcmod/take aux/can advcl/have de
4                                                 nsubj/Kat  ./, root/be nsubj/you dep/taper ./off ./
5  /to xcomp/explain prep_from/what det/the      nsubj/mark  ./on poss/my prep_on/arm ./, conj_and/ne
6   det/the amod/first ./and conj_and/third  nsubj/hospital  nsubj/I rcmod/be advmod/at root/have num
7  e dobj/tv mark/while det/the amod/second  nsubj/hospital  nsubj/I cop/be rcmod/IP prep/at pcomp/in
8                                                 nsubj/Ben  ./, mark/if nsubj/you cop/be advcl/unhap
9  h ./of prep_of/sleep advmod/when det/the   nsubj/reality  advcl/be ./, nsubj/everyone ccomp/need n

`Interrogation`

The corpus.interrogate() method returns an Interrogation object. These have attributes:

Attribute	Contains
`interrogation.results`	Pandas DataFrame of counts in each subcorpus
`interrogation.totals`	Pandas Series of totals for each subcorpus/result
`interrogation.query`	a `dict` of values used to generate the interrogation

and methods:

Method	Purpose
`interrogation.edit()`	Get relative frequencies, merge/remove results/subcorpora, calculate keywords, sort using linear regression, etc.
`interrogation.visualise()`	visualise results via matplotlib
`interrogation.save()`	Save data as pickle
`interrogation.quickview()`	Show top results and their absolute/relative frequency

These methods have been monkey-patched to Pandas' DataFrame and Series objects, as well, so any slice of a result can be edited or plotted easily.

`edit()` method

Remove, keep or merge interrogation results or subcorpora using indices, words or regular expressions (see below)
Sort results by name or total frequency
Use linear regression to figure out the trajectories of results, and sort by the most increasing, decreasing or static values
Show the p-value for linear regression slopes, or exclude results above p
Work with absolute frequency, or determine ratios/percentage of another list:
- determine the total number of verbs, or total number of verbs that are be
- determine the percentage of verbs that are be
- determine the percentage of be verbs that are was
- determine the ratio of was/were ...
- etc.
Plot more advanced kinds of relative frequency: for example, find all proper nouns that are subjects of clauses, and plot each word as a percentage of all instances of that word in the corpus (see below)

`visualise()` method

Plot using Matplotlib
Plot anything you like: words, tags, counts for grammatical features ...
Create line charts, bar charts, pie charts, etc. with the kind argument
Use subplots=True to produce individual charts for each result
Customisable figure titles, axes labels, legends, image size, colormaps, etc.
Use TeX if you have it
Use log scales if you really want
Use a number of chart styles, such as ggplot, fivethirtyeight or seaborn-talk (if you've got seaborn installed)
Save images to file, as .pdf or .png
Experimental interactive plots (hover-over text, interactive legends) using mpld3

Functions, lists, etc.

There are quite a few helper functions for making regular expressions, making new projects, and so on, with more documentation forthcoming. Also included are some lists of words and dependency roles, which can be used to match functional linguistic categories. These are explained in more detail here.

Installation

You can get corpkit running by downloading or cloning this repository, or via pip.

By downloading the repository

Hit 'Download ZIP' and unzip the file. Then cd into the newly created directory and install:

cd corpkit-master
# might need sudo:
python setup.py install

By cloning the repository

Clone the repo, cd into it and run the setup:

git clone https://github.com/interrogator/corpkit.git
cd corpkit
# might need sudo:
python setup.py install

Via `pip`

# might need sudo:
pip install corpkit
# or, for a local install:
# pip install --user corpkit

corpkit should install all the necessary dependencies, including pandas, NLTK, matplotlib, etc, as well as some NLTK data files.

Quickstart

Once you've got corpkit, and a folder containing text files, you're ready to go:

### import everything
>>> from corpkit import *

### Make corpus object from path to subcorpora/text files
>>> unparsed = Corpus('data/nyt/years')

### parse it, return the new parsed corpus object
>>> corpus = unparsed.parse()

### search corpus for modal auxiliaries and plot the top results
>>> corpus.interroplot('MD')

Output:

More detailed examples

interroplot() is just a demo method that does three things in order:

uses interrogate() to search corpus for a (Regex- or Tregex-based) query
uses edit() to calculate the relative frequencies of each result
uses visualise() to show the top seven results

Here's an example of the three methods at work:

### make tregex query: head of NP in PP containing 'of' in NP headed by risk word:
>>> q = r'/NN.?/ >># (NP > (PP <<# /(?i)of/ > (NP <<# (/NN.?/ < /(?i).?\brisk.?/))))'

### search trees, exclude 'risk of rain', output lemma
>>> risk_of = corpus.interrogate({T: q}, exclude={W: '^rain$'}, show=L)
### alternative syntax which may be easier when there's only a single search criterion:
# >>> risk_of = corpus.interrogate(T, q, exclude={W: '^rain$'}, show=L)

### use edit() to turn absolute into relative frequencies
>>> to_plot = risk_of.edit('%', risk_of.totals)

### plot the results
>>> to_plot.visualise('Risk of (noun)', y_label='Percentage of all results',
...    style='fivethirtyeight')

Output:

`search`, `exclude` and `show`

In the example above, parse trees are searched, a particular match is excluded, and lemmata are shown. These three arguments (search, exclude and show) are the core of the interrogate() and concordance() methods.

the search and exclude arguments need a dict, with the thing to be searched as keys and the search pattern as values. Here is a list of available keys for plaintext, tokenised and parsed corpora:

Key	Gloss
`W`	Word
`L`	Lemma
`I`	Index of token in sentence
`N`	N-gram

For parsed corpora, there are many other possible keys:

Key	Gloss
`P`	Part of speech tag
`X`	Word class
`G`	Governor word
`GL`	Governor lemma form
`GP`	Governor POS
`GF`	Governor function
`D`	Dependent word
`DL`	Dependent lemma form
`DP`	Dependent POS
`DF`	Dependent function
`F`	Dependency function
`R`	Distance from 'root'
`T`	Tree
`S`	Predefined general stats

Allowable combinations are subject to common sense. If you're searching trees, you can't also search governors or dependents. If you're searching an unparsed corpus, you can't search for information provided by the parser. Here are some example search/exclude values:

search/exclude	Gloss
`{W: r'^p'}`	Tokens starting with P
`{L: r'any'}`	Any lemma (often equivalent to `r'.*'`)
`{G: r'ing$'}`	Tokens with governor word ending in 'ing'
`{F: funclist}`	Tokens whose dependency function matches a `str` in `funclist`
`{D: r'^br', GL: r'$have$'}`	Tokens with dependent starting with 'br' and 'have' as governor lemma
`{I: '0', F: '^nsubj$'}`	Sentence initial tokens with role of `nsubj`
`{T: r'NP !<<# /NN.?'}`	NPs with non-nominal heads

If you'd prefer, you can make a dict to handle dependent and governor information, instead of using things like GL or DF. The following searches produce the same output:

>>> crit = {W: r'^friend$', 
...         D: {F: 'amod', 
...             W: 'great'}}

>>> crit = {W: r'^friend$', DF: 'amod', D: 'great'}

By default, all search criteria must match, but any exclude criterion is enough to exclude a match. This beahviour can be changed with the searchmode and excludemode arguments:

### get words that end in 'ing' OR are nominal:
>>> out = corpus.interrogate({W: 'ing$', P: r'^N'}, searchmode='any')
### get any word, but exclude words that end in 'ing' AND are nominal:
>>> out = corpus.interrogate({W: 'any'}, exclude={W: 'ing$', P: N}, excludemode='all')

The show argument wants a list of keys you'd like to return for each result. The order will be respected. If you only want one thing, a str is OK. One additional possibility is C, which returns the number of occurrences only.

`show`	return
`W`	`'champions'`
`[W]`	`'champions'`
`L`	`'champion'`
`P`	`'NNS'`
`X`	`'Noun'`
`T`	`'(np (jj prevailing) (nns champions))'` (depending on Tregex query)
`[P, W]`	`'NNS/champions'`
`[W, P]`	`'champions/NNS'`
`[I, L, R]`	`'2/champion/1'`
`[L, D, F]`	`'champion/prevailing/nsubj'`
`[G, GL, I]`	`'are/be/2'`
`[GL, GF, GP]`	`'be/root/vb'`
`[L, L]`	`'champion/champion'`
`[C]`	`24`

Again, common sense dictates what is possible. When searching trees, only trees, words, lemmata, POS and counts can be returned. If showing trees, you can't show anything else. If you use C, you can't use anything else.

Working with coreferences

One major challenge in corpus linguistics is the fact that pronouns stand in for other words. Parsing provides coreference resolution, which maps pronouns to the things they denote. You can enable this kind of parsing by specifying the dcoref annotator:

>>> ops = 'tokenize,ssplit,pos,lemma,parse,ner,dcoref'
>>> parsed = corpus.interrogate(operations=ops)

If you have done this, you can use coref=True while interrogating to allow coreferents to be mapped together:

>>> corpus.interrogate(query, coref=True)

So, if you wanted to find all the processes a certain entity is engaged in, you can get a more complete result with:

>>> from corpkit.dictionaries import roles
>>> corpus.interrogate({W: 'clinton', GF: roles.process}, coref=True)

This will count support in Clinton supported the independence of Kosovo, and also potentially authorize in He authorized the use of force. You can also toggle the representative=True and non_representative=True arguments if you want to distinguish between copula and non-copula coreference.

>>> corpus.interrogate({W: 'clinton', GF: roles.process}, coref=True, representative=False)

Building corpora

corpkit's Corpus() class contains parse() and tokenise(), methods for created parsed and/or tokenised corpora. The main thing you need is a folder, containing either text files, or subfolders that contain text files. Stanford CoreNLP is required to parse corpora. If you don't have it, corpkit can download and install it for you. If you're tokenising, you'll need to make sure you have NLTK's tokeniser data. You can then run:

>>> unparsed = Corpus('path/to/unparsed/files')

### to parse, you can set a path to corenlp
>>> corpus = unparsed.parse(corenlppath='Downloads/corenlp')

### to tokenise, point to nltk:
# >>> corpus = unparsed.tokenise(nltk_data_path='Downloads/nltk_data')

which creates the parsed/tokenised corpora, and returns Corpus() objects representing them. When parsing, you can also optionally pass in a string of annotators, as per the CoreNLP documentation:

>>> ans = 'tokenize,ssplit,pos'
### you can also set memory and turn off copula head parsing,
### or multiprocess the parsing job (though you'll want a big machine)
>>> corpus = unparsed.parse(operations=ans, memory_mb=3000, 
...                         copula_head=False, multiprocess=4)

Speaker IDs

Something novel about corpkit is that it can work with corpora containing speaker IDs (scripts, transcripts, logs, etc.), like this:

JOHN: Why did they change the signs above all the bins?
SPEAKER23: I know why. But I'm not telling.

If you use:

>>> corpus = unparsed.parse(speaker_segmentation=True)

This will:

Detect any IDs in any file
Create a duplicate version of the corpus with IDs removed
Parse this 'cleaned' corpus
Add an XML tag to each sentence with the name of the speaker
Return the parsed corpus as a Corpus() object

When interrogating or concordancing, you can then pass in a keyword argument to restrict searches to one or more speakers:

>>> s = ['BRISCOE', 'LOGAN']
>>> npheads = interrogate(T, r'/NN.?/ >># NP', just_speakers=s)

This makes it possible to not only investigate individual speakers, but to form an understanding of the overall tenor/tone of the text as well: Who does most of the talking? Who is asking the questions? Who issues commands?

Navigating parsed corpora

When your data is parsed, Corpus objects draw on CoreNLP XML to keep everything seamlessly connected:

>>> corp = Corpus('data/CHT-parsed')
>>> corp.subcorpora['2013'].files[1].document.sentences[4235].parse_string
### '(ROOT (FRAG (CC And) (NP (NP (RB not) (RB just)) (NP (NP (NNP Metrione) ... '
>>> corp.subcorpora['1997'].files[0].document.sentences[3509].tokens[30].word
### 'linguistics'

Getting general stats

Once you have a parsed Corpus() object, enter corpus.features to interrogate the corpus for some basic frequencies:

>>> corpus = Corpus('data/sessions-parsed')
>>> corpus.features

Output:

    Characters  Tokens  Words  Closed class words  Open class words  Clauses  Sentences  Unmodalised declarative  Mental processes   Relational processes  Interrogative  Passives  Verbal processes   Modalised declarative  Open interrogative  Imperative  Closed interrogative  
01       26873    8513   7308                4809              3704     2212        577                      280               156                     98             76        35                39                      26                   8           2                      3    
02       25844    7933   6920                4313              3620     2270        266                      130               195                    109             29        19                35                      11                   5           1                      3    
03       18376    5683   4877                3067              2616     1640        330                      174               132                     68             30        40                29                       8                  12           6                      1    
04       20066    6354   5366                3587              2767     1775        319                      174               176                     83             33        30                20                       9                   9           4                      1    
05       23461    7627   6217                4400              3227     1978        479                      245               154                     93             45        51                28                      20                   5           3                      1    
06       19164    6777   5200                4151              2626     1684        298                      111               165                     83             43        56                14                      10                   6           6                      2    
07       22349    7039   5951                4012              3027     1947        343                      183               195                     82             29        30                38                      12                   5           5                      0    
08       26494    8760   7124                4960              3800     2379        545                      263               170                     87             66        36                32                      10                   6           5                      4    
09       23073    7747   6193                4524              3223     2056        310                      149               164                     88             21        26                22                      10                   5           3                      0    
10       20648    6789   5608                3817              2972     1795        437                      265               139                    101             34        34                39                      18                   5           3                      2    
11       25366    8533   6899                4925              3608     2207        457                      230               203                    116             39        48                47                      15                  10           4                      0    
12       16976    5742   4624                3274              2468     1567        258                      135               183                     72             23        43                22                       4                   3           1                      6    
13       25807    8546   6966                4768              3778     2345        477                      257               200                    124             45        50                36                      15                  12           3                      2

Features such as relational/mental/verbal processes are difficult to locate automatically, so these counts are perhaps best seen as approximations. Even so, this data can be very helpful when using edit() to generate relative frequencies, for example.

Concordancing

Unlike most concordancers, which are based on plaintext corpora, corpkit can concordance grammatically, using the same kind of search, exclude and show values as interrogate().

>>> subcorpus = corpus.subcorpora.c2005
### C is added above to make a valid variable name from an int
### can also be accessed as corpus.subcorpora['2005']
### or corpus.subcorpora[index]
>>> query = r'/JJ.?/ > (NP <<# (/NN.?/ < /\brisk/))'
### T option for tree searching
>>> lines = subcorpus.concordance(T, query, window=50, n=10, random=True)

Output (a Pandas DataFrame):

0    hedge funds or high-risk stocks obviously poses a         greater   risk to the pension program than a portfolio of   
1           contaminated water pose serious health and   environmental   risks                                             
2   a cash break-even pace '' was intended to minimize       financial   risk to the parent company                        
3                                                Other           major   risks identified within days of the attack        
4                           One seeks out stocks ; the           other   monitors risks                                    
5        men and women in Colorado Springs who were at            high   risk for H.I.V. infection , because of            
6   by the marketing consultant Seth Godin , to taking      calculated   risks , in the opinion of two longtime business   
7        to happen '' in premises '' where there was a            high   risk of fire                                      
8       As this was match points , some of them took a          slight   risk at the second trick by finessing the heart   
9     said that the agency 's continuing review of how         Guidant   treated patient risks posed by devices like the

You can also concordance via dependencies:

### match words starting with 'st' filling function of nsubj
>>> criteria = {W: r'^st', F: r'nsubj$'}
### show function, pos and lemma (in that order)
>>> lines = subcorpus.concordance(criteria, show =[F,P,L])
>>> lines.format(window=30, n=10, columns=[L,M,R])

Output:

0  ime ./:/; cc/CC/and det/DT/the        nsubj/NN/stock  conj:and/VBZ/be advmod/RB/hist
1  vmod/RB/even compound/NN/sleep       nsubj/NNS/study  ./,/, appos/NNS/evaluation cas
2  od:poss/NNS/veteran case/POS/'        nsubj/NN/study  ccomp/VBZ/suggest mark/IN/that
3                        det/DT/a        nsubj/NN/study  case/IN/in nmod:poss/NN/today 
4            cc/CC/but det/DT/the        nsubj/NN/study  root/VBD/find mark/IN/that cas
5  pound/NN/a amod/JJ/preliminary        nsubj/NN/study  case/IN/of nmod:of/NNS/woman c
6  case/IN/for nmod:for/WDT/which  nsubj/NNS/statistics  acl:relcl/VBD/be xcomp/JJ/avai
7                amod/JJR/earlier       nsubj/NNS/study  aux/VBD/have root/VBN/show mar
8  ay det/DT/the amod/JJR/earlier       nsubj/NNS/study  aux/VBD/do neg/RB/not ccomp/VB
9  /there root/VBP/be det/DT/some    nsubj/NNS/strategy  ./:/- dep/JJS/most case/IN/of

You can search tokenised corpora or plaintext corpora for regular expressions or lists of words to match. The two queries below will return identical results:

>>> r_query = r'^fr?iends?$'
>>> l_query = ['friend', 'friends', 'fiend', 'fiends']
>>> lines = subcorpus.concordance({W: r_query})
>>> lines = subcorpus.concordance({W: l_query})

If you really wanted, you can then go on to use concordance() output as a dictionary, or extract keywords and ngrams from it, or keep or remove certain results with edit(). If you want to give the GUI a try, you can colour-code and create thematic categories for concordance lines as well.

Systemic functional stuff

Because I mostly use systemic functional grammar, there is also a simple tool for distinguishing between process types (relational, mental, verbal) when interrogating a corpus. If you add words to the lists in dictionaries/process_types.py, corpkit will get their inflections automatically.

>>> from corpkit.dictionaries import processes

### match nsubj with verbal process as governor
>>> crit = {F: '^nsubj$', G: processes.verbal}
### return lemma of the nsubj
>>> sayers = corpus.interrogate(crit, show=L)

### have a look at the top results
>>> sayers.quickview(n=20)

Output:

  0: he         (n=24530)
  1: she        (n=5558)
  2: they       (n=5510)
  3: official   (n=4348)
  4: it         (n=3752)
  5: who        (n=2940)
  6: that       (n=2665)
  7: i          (n=2062)
  8: expert     (n=2057)
  9: analyst    (n=1369)
 10: we         (n=1214)
 11: report     (n=1103)
 12: company    (n=1070)
 13: which      (n=1043)
 14: you        (n=987)
 15: researcher (n=987)
 16: study      (n=901)
 17: critic     (n=826)
 18: person     (n=802)
 19: agency     (n=798)
 20: doctor     (n=770)

First, let's try removing the pronouns using edit(). The quickest way is to use the editable wordlists stored in dictionaries/wordlists:

>>> from corpkit.dictionaries import wordlists
>>> prps = wordlists.pronouns

# alternative approaches:
# >>> prps = [0, 1, 2, 4, 5, 6, 7, 10, 13, 14, 24]
# >>> prps = ['he', 'she', 'you']
# >>> prps = as_regex(wl.pronouns, boundaries='line')
# or, by re-interrogating:
# >>> sayers = corpus.interrogate(crit, show=L, exclude={W: wordlists.pronouns})

### give edit() indices, words, wordlists or regexes to keep remove or merge
>>> sayers_no_prp = sayers.edit(skip_entries=prps, skip_subcorpora=[1963])
>>> sayers_no_prp.quickview(n=10)

Output:

  0: official (n=4342)
  1: expert (n=2055)
  2: analyst (n=1369)
  3: report (n=1098)
  4: company (n=1066)
  5: researcher (n=987)
  6: study (n=900)
  7: critic (n=825)
  8: person (n=801)
  9: agency (n=796)

Great. Now, let's sort the entries by trajectory, and then plot:

### sort with edit()
### use scipy.linregress to sort by 'increase', 'decrease', 'static', 'turbulent' or P
### other sort_by options: 'name', 'total', 'infreq'
>>> sayers_no_prp = sayers_no_prp.edit('%', sayers.totals, sort_by='increase')

### make an area chart with custom y label
>>> sayers_no_prp.visualise('Sayers, increasing', kind='area', 
...    y_label='Percentage of all sayers')

Output:

We can also merge subcorpora. Let's look for changes in gendered pronouns:

>>> merges = {'1960s': r'^196', 
...           '1980s': r'^198', 
...           '1990s': r'^199', 
...           '2000s': r'^200',
...           '2010s': r'^201'}

>>> sayers = sayers.edit(merge_subcorpora=merges)

### now, get relative frequencies for he and she
### SELF calculates percentage after merging/removing etc has been performed,
### so that he and she will sum to 100%. Pass in `sayers.totals` to calculate 
### he/she as percentage of all sayers
>>> genders = sayers.edit('%', SELF, just_entries=['he','she'])

### and plot it as a series of pie charts, showing totals on the slices:
>>> genders.visualise('Pronominal sayers in the NYT', kind='pie',
...    subplots=True, figsize=(15,2.75), show_totals='plot')

Output:

Woohoo, a decreasing gender divide!

Keywording

As I see it, there are two main problems with keywording, as typically performed in corpus linguistics. First is the reliance on 'balanced'/'general' reference corpora, which are obviously a fiction. Second is the idea of stopwords. Essentially, when most people calculate keywords, they use stopword lists to automatically filter out words that they think will not be of interest to them. These words are generally closed class words, like determiners, prepositions, or pronouns. This is not a good way to go about things: the relative frequencies of I, you and one can tell us a lot about the kinds of language in a corpus. More seriously, stopwords mean adding subjective judgements about what is interesting language into a process that is useful precisely because it is not subjective or biased.

So, what to do? Well, first, don't use 'general reference corpora' unless you really really have to. With corpkit, you can use your entire corpus as the reference corpus, and look for keywords in subcorpora. Second, rather than using lists of stopwords, simply do not send all words in the corpus to the keyworder for calculation. Instead, try looking for key predicators (rightmost verbs in the VP), or key participants (heads of arguments of these VPs):

### just heads of participants' lemma form (no pronouns, though!)
>>> part = r'/(NN|JJ).?/ >># (/(NP|ADJP)/ $ VP | > VP)'
>>> p = corpus.interrogate(T, part, show=L)

When using edit() to calculate keywords, there are a few default parameters that can be easily changed:

Keyword argument	Function	Default setting	Type
`threshold`	Remove words occurring fewer than `n` times in reference corpus	`False`	`'high/medium/low'`/ `True/False` / `int`
`calc_all`	Calculate keyness for words in both reference and target corpus, rather than just target corpus	`True`	`True/False`
`selfdrop`	Attempt to remove target data from reference data when calculating keyness	`True`	`True/False`

Let's have a look at how these options change the output:

### SELF as reference corpus uses p.results
>>> options = {'selfdrop': False, 
...            'calc_all': False, 
...            'threshold': False}

>>> for k, v in options.items():
...    key = p.edit('keywords', SELF, k=v)
...    print key.results.ix['2011'].order(ascending=False)

Output:

#1: default		#2: no `selfdrop`		#3: no `calc_all`		#4: no `threshold`
risk	1941.47	risk	1909.79	risk	1941.47	bank	668.19
bank	1365.70	bank	1247.51	bank	1365.70	crisis	242.05
crisis	431.36	crisis	388.01	crisis	431.36	obama	172.41
investor	410.06	investor	387.08	investor	410.06	demiraj	161.90
rule	316.77	rule	293.33	rule	316.77	regulator	144.91
	...		...		...		...
clinton	-37.80	tactic	-35.09	hussein	-25.42	clinton	-87.33
vioxx	-38.00	vioxx	-35.29	clinton	-37.80	today	-89.49
greenspan	-54.35	greenspan	-51.38	vioxx	-38.00	risky	-125.76
bush	-153.06	bush	-143.02	bush	-153.06	bush	-253.95
yesterday	-162.30	yesterday	-151.71	yesterday	-162.30	yesterday	-268.29

As you can see, slight variations on keywording give different impressions of the same corpus!

A key strength of corpkit's approach to keywording is that you can generate new keyword lists without re-interrogating the corpus. We can use some Pandas syntax to do this more quickly.

>>> yrs = ['2011', '2012', '2013', '2014']
>>> keys = p.results.ix[yrs].sum().edit('keywords', p.results.drop(yrs),
...    threshold=False)
>>> print keys.results

Output:

bank          1795.24
obama          722.36
romney         560.67
jpmorgan       527.57
rule           413.94
dimon          389.86
draghi         349.80
regulator      317.82
italy          282.00
crisis         243.43
putin          209.51
greece         208.80
snowden        208.35
mf             192.78
adoboli        161.30

... or track the keyness of a set of words over time:

>>> twords = ['terror', 'terrorism', 'terrorist']
>>> terr = p.edit(K, SELF, merge_entries={'terror': twords})
>>> print terr.results.terror

Output:

1963    -2.51
1987    -3.67
1988   -16.09
1989    -6.24
1990   -16.24
...       ...
Name: terror, dtype: float64

Visualising keywords

Naturally, we can use visualise() for our keywords too:

>>> pols.results.terror.visualise('Terror* as Participant in the \emph{NYT}', 
...    kind='area', stacked=False, y_label='L/L Keyness')
>>> politicians = ['bush', 'obama', 'gore', 'clinton', 'mccain', 
...                'romney', 'dole', 'reagan', 'gorbachev']
>>> k.results[politicans].visualise('Keyness of politicians in the \emph{NYT}', 
...    num_to_plot='all', y_label='L/L Keyness', kind='area', legend_pos='center left')

Output:

Traditional reference corpora

If you still want to use a standard reference corpus, you can do that (and a dictionary version of the BNC is included). For the reference corpus, edit() recognises dicts, DataFrames, Series, files containing dicts, or paths to plain text files or trees.

### arbitrary list of common/boring words
>>> from corpkit.dictionaries import stopwords
>>> print p.results.ix['2013'].edit(K, 'bnc.p', skip_entries=stopwords).results
>>> print p.results.ix['2013'].edit(K, 'bnc.p', calc_all=False).results

Output (not so useful):

#1                                #2
bank           5568.25            bank      5568.25
person         5423.24            person    5423.24
company        3839.14            company   3839.14
way            3537.16            way       3537.16
state          2873.94            state     2873.94
                ...                           ...  
three          -691.25            ten       -199.36
people         -829.97            bit       -205.97
going          -877.83            sort      -254.71
erm           -2429.29            thought   -255.72
yeah          -3179.90            will      -679.06

Parallel processing

interrogate() can also do parallel-processing. You can generally improve the speed of an interrogation by setting the multiprocess argument:

### set num of parallel processes manually
>>> data = corpus.interrogate({T: r'/NN.?/ >># NP'}, multiprocess=3)
### set num of parallel processes automatically
>>> data = corpus.interrogate({T: r'/NN.?/ >># NP'}, multiprocess=True)

Multiprocessing is particularly useful, however, when you are interested in multiple corpora, speaker IDs, or search queries. The sections below explain how.

Multiple corpora

To parallel-process multiple corpora, first, wrap them up as a Corpora() object. To do this, you can pass in:

a list of paths
a list of Corpus() objects
A single path string that contains corpora

>>> from corpkit.corpus import Corpora
>>> corpora = Corpora('./data') # path containing corpora
>>> corpora
### <corpkit.corpus.Corpora instance: 6 items>

### interrogate by parallel processing, 4 at a time
>>> output = corpora.interrogate(T, r'/NN.?/ < /(?i)^h/', show=L, multiprocess=4)

The output of a multiprocessed interrogation will generally be a dict with corpus/speaker/query names as keys. The main exception to this is if you use show=C, which will concatenate results from each query into a single Interrogation object, using corpus/speaker/query names as column names.

Multiple speakers

Passing in a list of speaker names will also trigger multiprocessing:

>>> from dictionary.wordlists import wordlists
>>> spkrs = ['MEYER', 'JAY']
>>> each_speaker = corpus.interrogate(W, wordlists.closedclass, just_speakers=spkrs)

There is also just_speakers='each', which will be automatically expanded to include every speaker name found in the corpus.

Multiple queries

You can also run a number of queries over the same corpus in parallel. There are two ways to do this.

### method one
>>> query = {'Noun phrases': r'NP', 'Verb phrases': r'VP'}
>>> phrases = corpus.interrogate(T, query, show=C)

### method two
>>> query = {'-ing words': {W: r'ing$'}, '-ed verbs': {P: r'^V', W: r'ed$'}}
>>> patterns = corpus.interrogate(query, show=L)

Let's try multiprocessing with multiple queries, showing count (i.e. returning a single results DataFrame). We can look at different risk processes (e.g. risk, take risk, run risk, pose risk, put at risk) using constituency parses:

>>> q = {'risk':        r'VP <<# (/VB.?/ < /(?i).?\brisk.?\b/)', 
...      'take risk':   r'VP <<# (/VB.?/ < /(?i)\b(take|takes|taking|took|taken)+\b/) < (NP <<# /(?i).?\brisk.?\b/)', 
...      'run risk':    r'VP <<# (/VB.?/ < /(?i)\b(run|runs|running|ran)+\b/) < (NP <<# /(?i).?\brisk.?\b/)', 
...      'put at risk': r'VP <<# /(?i)(put|puts|putting)\b/ << (PP <<# /(?i)at/ < (NP <<# /(?i).?\brisk.?/))', 
...      'pose risk':   r'VP <<# (/VB.?/ < /(?i)\b(pose|poses|posed|posing)+\b/) < (NP <<# /(?i).?\brisk.?\b/)'}

# show=C will collapse results from each search into single dataframe
>>> processes = corpus.interrogate(T, q, show=C)
>>> proc_rel = processes.edit('%', processes.totals)
>>> proc_rel.visualise('Risk processes')

Output:

More complex queries and plots

Next, let's find out what kinds of noun lemmas are subjects of any of these risk processes:

### a query to find heads of nps that are subjects of risk processes
>>> query = r'/^NN(S|)$/ !< /(?i).?\brisk.?/ >># (@NP $ (VP <+(VP) (VP ( <<# (/VB.?/ < /(?i).?\brisk.?/) ' \
...    r'| <<# (/VB.?/ < /(?i)\b(take|taking|takes|taken|took|run|running|runs|ran|put|putting|puts)/) < ' \
...    r'(NP <<# (/NN.?/ < /(?i).?\brisk.?/))))))'
>>> noun_riskers = c.interrogate(T, query, show=L)
 
>>> noun_riskers.quickview(10)

Output:

  0: person (n=195)
  1: company (n=139)
  2: bank (n=80)
  3: investor (n=66)
  4: government (n=63)
  5: man (n=51)
  6: leader (n=48)
  7: woman (n=43)
  8: official (n=40)
  9: player (n=39)

We can use edit() to make some thematic categories:

### get everyday people
>>> p = ['person', 'man', 'woman', 'child', 'consumer', 'baby', 'student', 'patient']
### get business, gov, institutions
>>> i = ['company', 'bank', 'investor', 'government', 'leader', 'president', 'officer', 
...      'politician', 'institution', 'agency', 'candidate', 'firm']
>>> merges = {'Everyday people': p, Institutions: i}

>>> them_cat = them_cat.edit('%', noun_riskers.totals,
...                          merge_entries=merges,
...                          sort_by='total',
...                          skip_subcorpora=1963,
...                          just_entries=merges.keys())

### plot result
>>> them_cat.visualise('Types of riskers', y_label='Percentage of all riskers')

Output:

Let's also find out what percentage of the time some nouns appear as riskers:

### find any head of an np not containing risk
>>> query = r'/NN.?/ >># NP !< /(?i).?\brisk.?/'
>>> noun_lemmata = corpus.interrogate(T, query, show=L)

### get some key terms
>>> people = ['man', 'woman', 'child', 'baby', 'politician', 
...           'senator', 'obama', 'clinton', 'bush']
>>> selected = noun_riskers.edit('%', noun_lemmata.results, 
...    just_entries=people, just_totals=True, threshold=0, sort_by='total')

### make a bar chart:
>>> selected.visualise('Risk and power', num_to_plot='all', kind='bar', 
...                    x_label='Word', y_label='Risker percentage', fontsize=15)

Output:

Visualisation options

With a bit of creativity, you can do some pretty awesome data-viz, thanks to Pandas and Matplotlib. The following plots require only one interrogation:

>>> modals = corpus.interrogate(T, 'MD < __', show=L)
### simple stuff: make relative frequencies for individual or total results
>>> rel_modals = modals.edit('%', modals.totals)

### trickier: make an 'others' result from low-total entries
>>> low_indices = range(7, modals.results.shape[1])
>>> each_md = modals.edit('%', modals.totals, merge_entries={'other': low_indices}, 
...                       sort_by='total', just_totals=True, keep_top=7)

### complex stuff: merge results
>>> entries_to_merge = [r'(^w|\'ll|\'d)', r'^c', r'^m', r'^sh']
>>> modals = modals.edit(merge_entries=entries_to_merge)
    
### complex stuff: merge subcorpora
>>> merges = {'1960s': r'^196', 
...           '1980s': r'^198', 
...           '1990s': r'^199', 
...           '2000s': r'^200',
...           '2010s': r'^201'}

>>> modals = sayers.edit(merge_subcorpora=merges)
    
### make relative, sort, remove what we don't want
>>> modals = modals.edit('%', modals.totals, keep_stats=False,
...    just_subcorpora=merges.keys(), sort_by='total', keep_top=4)

### show results
>>> print rel_modals.results, each_md.results, modals.results

Output:

          would       will        can      could  ...        need     shall      dare  shalt
1963  22.326833  23.537323  17.955615   6.590451  ...    0.000000  0.537996  0.000000      0
1987  24.750614  18.505132  15.512505  11.117537  ...    0.072286  0.260228  0.014457      0
1988  23.138986  19.257117  16.182067  11.219364  ...    0.091338  0.060892  0.000000      0
...         ...        ...        ...        ...  ...         ...       ...       ...    ...
2012  23.097345  16.283186  15.132743  15.353982  ...    0.029499  0.029499  0.000000      0
2013  22.136269  17.286522  16.349301  15.620351  ...    0.029753  0.029753  0.000000      0
2014  21.618357  17.101449  16.908213  14.347826  ...    0.024155  0.000000  0.000000      0
[29 rows x 17 columns] 

would     23.235853
will      17.484034
can       15.844070
could     13.243449
may        9.581255
should     7.292294
other      7.290155
Name: Combined total, dtype: float64 

       would/will/'ll...  can/could/ca  may/might/must  should/shall/shalt
1960s          47.276395     25.016812       19.569603            7.800941
1980s          44.756285     28.050776       19.224476            7.566817
1990s          44.481957     29.142571       19.140310            6.892708
2000s          42.386571     30.710739       19.182867            7.485681
2010s          42.581666     32.045745       17.777845            7.397044

Now, some intense plotting:

### exploded pie chart
>>> each_md.visualise('Pie chart of common modals in the NYT', explode=['other'],
...    num_to_plot='all', kind='pie', colours='Accent', figsize=(11,11))

### bar chart, transposing and reversing the data
>>> modals.results.iloc[::-1].T.iloc[::-1].visualise('Modals use by decade', kind='barh',
...    x_label='Percentage of all modals', y_label='Modal group')

### stacked area chart
>>> rel_modals.results.drop('1963').visualise('An ocean of modals', kind='area', 
...    stacked=True, colours='summer', figsize =(8,10), num_to_plot='all', 
...    legend_pos='lower right', y_label='Percentage of all modals')

Output:

Contact

Twitter: @interro_gator

Cite

McDonald, D. (2015). corpkit: a toolkit for corpus linguistics. Retrieved from https://www.github.com/interrogator/corpkit. DOI: http://doi.org/10.5281/zenodo.28361

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API-README.md

API-README.md

corpkit: API readme

What's in here?

`Corpus()`

Navigating `Corpus` objects

`interrogate()` method

`concordance()` method

`Interrogation`

`edit()` method

`visualise()` method

Functions, lists, etc.

Installation

By downloading the repository

By cloning the repository

Via `pip`

Quickstart

More detailed examples

`search`, `exclude` and `show`

Working with coreferences

Building corpora

Speaker IDs

Navigating parsed corpora

Getting general stats

Concordancing

Systemic functional stuff

Keywording

Visualising keywords

Traditional reference corpora

Parallel processing

Multiple corpora

Multiple speakers

Multiple queries

More complex queries and plots

Visualisation options

Contact

Cite

Files

API-README.md

Latest commit

History

API-README.md

File metadata and controls

corpkit: API readme

What's in here?

Corpus()

Navigating Corpus objects

interrogate() method

concordance() method

Interrogation

edit() method

visualise() method

Functions, lists, etc.

Installation

By downloading the repository

By cloning the repository

Via pip

Quickstart

More detailed examples

search, exclude and show

Working with coreferences

Building corpora

Speaker IDs

Navigating parsed corpora

Getting general stats

Concordancing

Systemic functional stuff

Keywording

Visualising keywords

Traditional reference corpora

Parallel processing

Multiple corpora

Multiple speakers

Multiple queries

More complex queries and plots

Visualisation options

Contact

Cite

`Corpus()`

Navigating `Corpus` objects

`interrogate()` method

`concordance()` method

`Interrogation`

`edit()` method

`visualise()` method

Via `pip`

`search`, `exclude` and `show`