The purpose of the package polmineR is to offer a toolset for the interactive analysis of corpora using R. Apart from performance and usability, key considerations for developing the package are:
To provide a library with standard tasks such as concordances/keyword-in-context, cooccurrence statistics, feature extraction, or preparation of term-document-matrics.
To keep the original text accessible and to offer a seamless integration of qualitative and quantitative steps in corpus analysis that facilitates validation.
To make the creation and analysis of subcorpora (‘partitions’) as easy as possible. A particular strength of the package is to support research on synchronic and diachronic change of language use.
To offer performance for users with a standard infrastructure. The package is developed based on the idea of a three-tier software design. Corpus data are managed and indexed using the Corpus Workbench (CWB), which serves as a backend.
To support sharing consolidated and documented data, following the ideas of reproducible research.
The polmineR package supplements R packages that are already widely used for text mining. The CRAN NLP task view is a good place to learn about relevant packages. The polmineR package is intended to be an interface between the Corpus Workbench (CWB), an efficient system for storing and querying large corpora, and existing packages for text mining and text statistics.
Apart from the speed of text processing, the Corpus Query Processor (CQP) and the CQP syntax provide a powerful and widely used syntax to query corpora. This is not an unique idea. Using a combination of R and the CWB implies a software architecture you will also find in the TXM project, or with CQPweb. The polmineR package offers a library with the grammer of corpus analysis below a graphical user interface (GUI). It is a toolset to perform simple tasts efficiently as well as to implement complex workflows.
Advanced users will benefit from acquiring a good understanding of the Corpus Workbench. The Corpus Encoding Tutorial is an authoritative text for that. The vignette of the rcqp package, albeit archived, includes an excellent explanation of the CWB data-model. The inferface used now to use CWB/CQP functionality is the RcppCWB package.
A basic issue to understand is the difference between s- and p-attributes. The CWB distinguishes structural attributes (s-attributes) that will contain the metainformation that can be used to generate subcorpora/partitions, and positional attributes (p-attributes). Typically, the p-attributes will be ‘word’, ‘pos’ (for part-of-speech) and ‘lemma’ (for the lemmatized word form).
The polmineR package is loaded just like any other package.
Upon loading the package, the package version is reported. As polmineR is under active development, please check whether a more recent version is available at CRAN. Development versions are available at GitHub.
In addition, you will see an information on the session registry, which needs some further explanation.
Indexed corpus data may be stored at different locations on your machine. CWB users will usually have a data directory with subdirectories for every single corpus. But corpus data may also reside within R packages, or anywhere else.
It is not necessary to move indexed corpora to one single location. The only recommendation is to have them on a device that can be accessed sufficiently fast. Corpora are not fully loaded into memory, but information is retrieved from disk on a ‘on demand’-basis. Thus, storing corpus data on a SSD may be faster than a hard drive.
The CWB will look up information on the corpora in a directory called
registry that is defined by the environment variable
CORPUS_REGISTRY. Starting with version v0.7.9, the polmineR
package creates a temporary registry directory in the temporary session
directory. To get the path of the session registry directory, call
registry(). The output is the session registry you have
seen when loading polmineR.
The session registry directory combines the registry files describing
the corpora polmineR knows about. Upon loading
polmineR, the files in the registry directory defined by the
environment variable CORPUS_REGISTRY are copied to the session registry
directory. To see whether the environment variable CORPUS_REGISTRY is
set, use the
See the annex for an explanation how to set the CORPUS_REGISTRY environment variable for the current R session, or permanently.
If you want to use a corpus wrapped into a R data package, call
use() with the name of the R package. The function will add
the registry files describing the corpora in the package to the session
registry directory introduced before.
In the followings examples, the REUTERS corpus included in the polmineR package will be used for demonstration purposes. It is a sample of Reuters articles that is included in the tm package (cp. http://www.daviddlewis.com/resources/testcollections/reuters21578/), and may already be known to some R users.
corpus()-method can be used to check which corpora
are described in the registry and accessible. The REUTERS corpus in our
case (note that the names of CWB corpora are always written upper case).
In addition to the English REUTERS corpus, a small subset of the
GermaParl corpus (“GERMAPARLMINI”) is included in the polmineR
## corpus encoding type template size ## 1 MIGPARL latin1 plpr TRUE 38692237 ## 2 GERMAPARL_PARLACLARIN_III utf8 plpr FALSE 271077449 ## 3 GERMAPARL latin1 plpr TRUE 101013708 ## 4 EUROPARL-IT latin1 <NA> FALSE 38966669 ## 5 EUROPARL-NL latin1 <NA> FALSE 39535069 ## 6 EUROPARL-DE latin1 <NA> FALSE 37461789 ## 7 UNGA utf8 <NA> FALSE 43081472 ## 8 RAMOMEDIA_FAZ latin1 <NA> TRUE 102363840 ## 9 GERMAPARLMINI latin1 plpr TRUE 222201 ## 10 REUTERS latin1 <NA> TRUE 4050 ## 11 RAMOMEDIA_SZ utf8 <NA> FALSE 226222025 ## 12 EUROPARL-EN latin1 <NA> FALSE 39431862 ## 13 EUROPARL-ES latin1 <NA> FALSE 40461850 ## 14 EUROPARL-FR latin1 <NA> FALSE 43988004
Many methods in the polmineR package use default settings that are set in the general options settings. Following a convention, settings relevant for the polmineR package simplystart with ‘polmineR.’ Inspect the settings as follows:
Several methods (such as kwic, or cooccurrences) will use these
settings, if no explicit other value is provided. You can see this in
the usage section of help pages (
?kwic, for instance). To
change settings, this is how.
options("polmineR.left" = 5) options("polmineR.right" = 5) options("polmineR.mc" = FALSE)
Core analytical tasks are implemented as methods (S4 class system), i.e. the bevaviour of the methods changes depending on the object that is supplied. Almost all methods can be applied to corpora (indicated by a length-one character vector) as well as partitions (subcorpora). As a quick start, methods applied to corpora are explained first.
The kwic method applied to the name of a corpus will return a KWIC object. Output will be shown in the viewer pane of RStudio. Technically, a htmlwidget is prepared which offers some convenient functionality.
<- kwic("REUTERS", "oil")k
You can include metadata from the corpus into the kwic display using the ‘s_attributes’ argument. Let us start with one s-attribute.
<- kwic("REUTERS", "oil", s_attributes = "places")k
But you can display any number of s-attributes.
<- kwic("REUTERS", "oil", s_attributes = c("id", "places"))k
You can also use the CQP query syntax for formulating queries. That way, you can find multi-word expressions, or match in a manner you may know from using regular expressions.
<- kwic("REUTERS", '"oil" "price.*"')k
Explaining the CQP syntax goes beyon this vignette. Consult the CQP tutorial to learn more about the CQP syntax.
You can count one or several hits in a corpus.
<- count("REUTERS", "Kuwait") cnt <- count("REUTERS", c("Kuwait", "USA", "Bahrain")) cnt <- count("REUTERS", c('"United" "States"', '"Saudi" "Arabia.*"'), cqp = TRUE)cnt
dispersion()-method to get dispersions of counts
accross one (or two) dimensions.
<- dispersion("REUTERS", query = "oil", s_attribute = "id", progress = FALSE)oil
<- dispersion( saudi_arabia "REUTERS", query = '"Saudi" "Arabia.*"', s_attribute = "id", cqp = TRUE, progress = FALSE )
Note that it is a data.table that is returned. You can proceed to a visualisation easily.
barplot(height = saudi_arabia[["count"]], names.arg = saudi_arabia[["id"]], las = 2)
To analyse the neighborhood of a token, or the match for a CQP query,
<- cooccurrences("REUTERS", query = "oil") oil <- cooccurrences("REUTERS", query = '"Saudi" "Arabia.*"', left = 10, right = 10) sa <- subset(oil, rank_ll <= 5)top5
In an interactive session, simply type
top5 in the
terminal, and the output will be shown in the data viewer. To inspect
the output in the viewer pane, you can coerce the object to a
htmlwidget. This is also a good way how to include the table in a
For further operations, get the the table with the statistical
results by applying the
## word word_id count_partition count_coi count_ref exp_coi exp_ref ## 1 prices 12 47 27 20 9.229607 37.770393 ## 2 crude 14 20 13 7 3.927492 16.072508 ## 3 industry 78 10 8 2 1.963746 8.036254 ## 4 recent 356 6 5 1 1.178248 4.821752 ## 5 on 169 17 8 9 3.338369 13.661631 ## ll rank_ll ## 1 33.045786 1 ## 2 19.616220 2 ## 3 16.968527 3 ## 4 11.331186 4 ## 5 6.505618 5
Working with partitions (i.e. subcorpora) based on s-attributes is an important feature of the ‘polmineR’ package. So if we want to work with the articles in the REUTERS corpus related to Kuaweit in 2006:
<- partition("REUTERS", places = "kuwait", regex = TRUE)kuwait
To get some basic information about the partition that has been set up, the ‘show’-method can be used. It is also called when you simply type the name of the partition object.
## ** partition object **
## corpus: REUTERS
## s-attributes: places = kuwait
## cpos: 3 pairs of corpus positions ## size: 660 tokens ## count: not available
To evaluate s-attributes, regular expressions can be used.
<- partition("REUTERS", places = "saudi-arabia", regex = TRUE) saudi_arabia s_attributes(saudi_arabia, "id")
##  "242" "248" "273" "349" "352"
If you work with a flat XML structure, the order of the provided s-attributes may be relevant for speeding up the set up of the partition. For a nested XML, it is important that with the order, you move from ancestors to childs. For further information, see the documentation of the partition-function.
The cooccurrences-method can be applied to partition-objects.
<- partition("REUTERS", places = "saudi-arabia", regex = TRUE) saudi_arabia <- cooccurrences(saudi_arabia, "oil", p_attribute = "word", left = 10, right = 10)oil
Note that is is possible to provide a query that uses the full CQP
syntax. The statistical analysis of collocations to the query can be
accessed as the slot “stat” of the context object. Alternatively, you
can get the table with the statistics using
<- as.data.frame(oil) df 1:5, c("word", "ll", "rank_ll")]df[
## word ll rank_ll ## 1 prices 5.383314 1 ## 2 by 5.081723 2 ## 3 Gulf 4.944806 3 ## 4 crude 2.968707 4 ## 5 last 2.756452 5
To understand the occurance of a phenomenon, the distribution of query results across one or two dimensions will often be interesing. This is done via the ‘distribution’ function. The query may use the CQP syntax.
<- dispersion(saudi_arabia, query = 'oil', s_attribute = "id", progress = FALSE) q1 <- dispersion(saudi_arabia, query = c("oil", "barrel"), s_attribute = "id", progress = FALSE)q2
To identify the specific vocabulary of a corpus of interest, a statistical test based (chi square, or log likelihood) can be performed.
<- partition("REUTERS", places = "saudi-arabia", regex = TRUE) saudi_arabia <- enrich(saudi_arabia, p_attribute = "word") saudi_arabia <- features(saudi_arabia, "REUTERS", included = TRUE) saudi_arabia_features <- subset(saudi_arabia_features, rank_chisquare <= 10.83 & count_coi >= 5) saudi_arabia_features_min saudi_arabia_features_min
To extract the statistical information, you can also use the
<- as.data.frame(saudi_arabia_features_min) df <- df[,c("word", "count_coi", "count_ref", "chisquare")]df_min
For many applications, term-document matrices are the point of departure. The tm class TermDocumentMatrix serves as an input to several R packages implementing advanced text mining techniques. Obtaining this input from a corpus imported to the CWB will usually involve setting up a partitionBundle and then applying a method to get the matrix.
<- corpus("REUTERS") %>% partition_bundle(s_attribute = "id", progress = FALSE) articles <- count(articles, p_attribute = "word") articles_count <- as.TermDocumentMatrix(articles_count, col = "count", verbose = FALSE) tdm class(tdm) # to see what it is
##  "TermDocumentMatrix" "simple_triplet_matrix"
## <<TermDocumentMatrix (terms: 1192, documents: 20)>> ## Non-/sparse entries: 2409/21431 ## Sparsity : 90% ## Maximal term length: 15 ## Weighting : term frequency (tf)
<- as.matrix(tdm) # turn it into an ordinary matrix m c("oil", "barrel"),]m[
## Docs ## Terms 127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 ## oil 5 12 2 1 0 6 4 2 5 6 5 4 4 4 3 4 5 ## barrel 2 0 1 1 0 3 0 0 0 2 2 0 1 0 0 1 1 ## Docs ## Terms 543 704 708 ## oil 2 3 1 ## barrel 1 0 0
A key consideration of the polmineR package is to offer tools for
combining quantitative and qualitative approaches to text analysis. Use
html()-method, or the
return to the full text. In this example, we define a maximum height for
the output, which is useful when including full text output in a
<- partition("REUTERS", id = "248") P <- html(P, height = "250px") H H
The package includes many features that go beyond this vignette. It is a key aim in the project to develop respective documentation in the vignette and the man pages for the individual functions further. Feedback is very welcome!
The environment variable “CORPUS_REGISTRY” can be set as follows in R:
Sys.setenv(CORPUS_REGISTRY = "C:/PATH/TO/YOUR/REGISTRY")
To set the environment variable CORPUS_REGISTRY permanently, see the
instructions R offer how to find the file ‘.Renviron’ or
‘.Renviron.site’ when calling the help for the startup