PolMine/polmineR: Caterpillar Mambo

New Features The corpus class has been put in a shape to become the default point of departure of most workflows. All core methods are now available for the corpus class, and have been implemented newly if necessary, e.g. show() and size()-method. The constructor method for a corpus object, the corpus() method, will now check whether the character vector with the corpus ID refers to an available corpus, whether all letters are upper case and issue informative warnings and error messages. The s_attributes()-method for corpus objects has been reworked: It will decode binary files directly, without reliance on the corpus library functions, which is significantly faster. The Corpus reference class is now obsolete after the introduction of the S4 corpus class. To maintain the functionality not covered otherwise, new generics get_info and show_info have been introduced and defined for the corpus class. Methods available for the subcorpus class have been expanded so that this class can supersede the partition class: Methods newly available are cpos(), count(), p_attributes(), s_attributes() get_token_stream(), and size(). Technically, there is virtual slice-class, from which subcorpus inherits (methods called via callNextMethod()). A new subset()-method for the corpus and subcorpus classes to generate subcorpora (i.e. subcorpus objects) has been introduced. It outperforms the partition() method. The subset()-method for corpus and subcorpus objects will be the default way to work with non standard evaluation in a manner that feels "R-ish" (#40). The zoom()-method that has been introduced experimentally has been dropped again in favor of the subset()-method to get subcorpus objects from corpus and subcorpus objects. A set of experimental methods for an initial check of the feasibility of a non-standard evaluation approach to the generation of subcorpora has been dropped (methods $, ==, !=, zoom for corpus-class). To facilitate the transition from the partition class (inheriting from the textstat class) to the subcorpus class (inheriting from the textstat class), there is a new coerce()-method to turn a partition object into a subcorpus object. A new remote_corpus-class is the basis for accessing remote corpora. A remote_subcorpus can be derived from a remote_corpus. Methods available for remote corpora und subcorpora remain limited at this stage. Consolidation of the class system: For all the S4 classes in the package, multiple contains have been checked, and multiple contains have been removed. The subcorpus_bundle class now inherits from partition_bundle. This is not intended to be a long-term solution, but facilitates the implementation of new workflows based on the subcorpus class rather than the partition class. Calling the polmineR shiny app via polmineR did not have safeguards if the suggested packages shiny and shinythemes were not installed. Now there will be a conditional installation of the packages required for running the shiny app. The somewhat odd class CorpusOrSubcorpus has been removed. The ngrams-method now applies for corpus and subcorpus objects. The pipe operator of the magrittr package is imported now, and magrittr has moved from a suggested package to a required package. The label()-method, present for a while, is superseded by a edit()-method now. It will call a shiny gadget either using DataTables or Handsontable. The former Labels reference class has been turned into a S4 class, because the desired reference logic can also be achieved with a data.table in a slot of the labels class. The table-slot of the kwic class has been renamed as stat slot (a data.table), so that the kwic class can now inherit from the textstat class. The enrich()-method for objects of class kwic now includes a new argument extra that will add extra tokens to the left of the windows for concordances so that qualitative inspections for query hits can work with more context. The as.TermDocumentMatrix() and the as.DocumentTermMatrix()-methods are now also defined for kwic objects. They work exactly the same as for the context class. To avoid having to write new methods, a new neighborhood virtual class has been introduced. The aforementioned methods are defined for the virtual class and are available for context and kwic class objects. Added CQP functionality to count tab in shiny app, and to the dispersion tab. There is now a basic implementation of get_token_stream() for a partition_bundle object. The Cooccurrences()-method is now available for subcorpus-objects (#88). There is a new coerce method to turn a kwic-object into a context-object. The neighborhood virtual class could be discarded again, and a bug could be removed that left an enrich()-operation for kwic objects (argument p_attribute) ineffectual (#103). Minor changes Added a new argument regex to the cpos()-method (for corpus objects), which will interpret argument query as a regular expression. This may be faster than taking query as an outright CQP query. The configure-script in the package that would adjust paths in the registry files for the corpora included in the package for documentation and testing purposes has been removed. Having switched to a temporary registry directory, it has lost its function. The version of the data.table package now required is 1.12.2, because previous versions did not allow adding columns to a new data.table. Implemented the possibility to use multiple queries in dispersion-method (#92). To keep up with the renaming of functions and arguments in the package, "sAttributes" and "pAttributes" in the polmineR shiny app have been renamed ("s_attributes", and "p_attributes", respectively). The shiny app module for kwic output will not show p_attribute and positivelist by default. The format()-method is used to create proper output in the cooccurrences of the shiny app. User names that include non-ASCII characters were a persistent problem on Windows machines (#66). The solution now is to check for non-ASCII characters in the path to the data directory, and to use the "old" short DOS path if necessary. The worker is a modified registry()-function. The ordering of the table for ll-method had been somewhat mixed up, which is repaired now. Tokens with NA values for the ll-test will show up at the end of the table. The registry_move()-function, used only internally at this stage, is exported now so that it can be used by other packages. The return value of the get_token_stream()-method for regions objects was a data.table. The behavior is now in line with the other get_token_stream() methods The tempcorpus()-method and the tempcorpus class have been removed from the package, having become utterly deprecated. The summary()-method for partition-class objects has been turned into a method for the count-class, to eliminate an inconsistency. The example of a workflow has been moved to the documentation object for the count-class. The browse()-method has not proven to be useful and has been removed from the package. A new browse()-function is introduced to throw a warning, if browse should be called nevertheless. A refactoring of the split()-method for partition-objects improved the readability of the code, but the performance gain is minimal. A new kwic_bundle-class has been introduced, a list of kwic objects can be turned into this new class using as.bundle. The context()-method will now take again as input character vectors for the arguments left and right to expand to the left and right boundaries of the designated region (#87). Rework of the way messages are printed to make it easy to implement notifications in the shiny environment. Default highlighting when a positivelist is supplied has been removed from the kwic()-method. This ensures that subsequent highlighting operations can assign new colors (#38). Implemented feature request for dispersion() that results are reported for all values of structural attributes, including those with zero matches. (#104) Performance improved for the cpos-method for matrix which unfolds a matrix with regions of corpus positions, useful for operations that require many calls. The count-method for partition_bundle has been reworked and is much faster and more memory efficient. as.TermDocumentMatrix() for partition_bundle optimized to work efficiently with large corpora. Introduction of a context,matrix-method to have a unified auxiliary function to create contexts. The as.corpusEnc()-function uses the localeToCharset()-function from the utils package to determine the charset of input strings. On RStudio Server, we have seen cases when the return value is NA. Then it will be assumed that the locale is UTF-8. Functionality to highlight terms in kwic display has been restored for the shiny app. Bug fixes Removed a bug in the context()/kwic() method that led to superfluous words in the right context. Removed a bug that occurred with the as.data.frame()-method for kwic-objects when no metadata were added. The count()-method for partition_bundle-objects did not perform iconv() if necessary - this has been corrected. Indexing the concordances of a kwic object did not reduce the cpos table concurringly. This has been corrected. The as.speeches()-method failed to handle situations correctly, when one speaker occurring in the corpus only contributed one single region to the entire corpus (#86). This behavior has been debugged. Counting over a partition_bundle started to throw a warning that an argument arrives at the cpos()-method that is not used. The cause for the warning message is removed, an additional unit test has been introduced to recognize issues with the count-method (#90). The kwic()-method threw an error when trimming the matches by using a positivelist or a stoplist resulted in no remaining matches. The method will now return a NULL object and keep issuing a warning if no matches remain after filtering (#91). Chaining subsetting calls on a corpus/subcorpus omitted filling the s_attribute slot of the subcorpus object, resulting in false results when counting over subcorpora. Fixed. Started to remove bugs in the shiny app: kwic starts to work again (bug: slot table has been replaced by stat). The part of the shiny app for dispersions did not work at all - has been repaired, exposing more functionality of dispersion() (#62). In the as.speeches()-method, the argument verbose was not used (#64) - this had been addressed when solving issue #86. Telling messages when sending out emails - on success and error - have been added (#61). A shortcoming in coerce method to turn a subcorpus into a String was removed: A semicolon was not recognized as a punctuation mark. This makes decoding subcorpora as Annotation more robust. The respective unit test has been updated. Calling read() on a kwic object works again (#84). Checks for the as.VCorpus() method that failed are now ok (#77). The reason was that get_token_stream() assumed implicitly that a p-attribute "pos" is present, which is not the case for the REUTERS test corpus. A minor bug in the s_attributes-method was removed that would make retrieving the metadata for the first strucs (index 0) of a s-attribute impossible. Fixed an issue for as.DocumentTermMatrix that started to occur with the introduction of the subcorpus_bundle class (#100). Removed a bug in the kwic-method for character that prevented using different values for right and left context (#101). Removed a bug that occurred when using as.DocumentTermMatrix() on a corpus stated by corpus ID / length-one character vector (#105). Removed a bug from the kwic,character-method, and the context,corpus-method that would result in odd behavior when either the left or right context is 0. An endemic encoding issue for full text output on Windows machines (latin1 encoding) has been solved by replacing internally markdown::markdownToHTML by a direct call to markdown::renderMarkdown. On this occasion, some overhead preparing fulltext output has been removed. A bug that prevented getting extra left and right context for kwic objects has been removed (#102). The as.TermDocumentMatrix()-method for neighborhood-objects returned a DocumentTermMatrix (unintendedly), this bug is removed now. Documentation Extended documentation for pmi()-method and t_test()-method. New s_attributes()-method for corpus-class. The documentation for the corpus-class has been rewritten entirely, and the documentation for the remote_corpus-class has been integrated, whereas methods applicable to the remote_corpous-class were integrated into the documentation objects for the respective methods. The documentation for the get_token_stream()-method has been reworked and expanded thoroughly (#65). On this occasion, test coverage for the method has been improved significantly. (Everything is tested now apart from parallelization.)

Related Organizations

University of Zurich
Switzerland

4 Research products, page 1 of 1

PolMine/polmineR: Jeanne d'Arc
2018IsAmongTopNSimilarDocuments
PolMine/polmineR: Bachelor's Delight
2018IsAmongTopNSimilarDocuments
polmineR: Verbs and Nouns for Corpus Analysis
2020IsVersionOf
PolMine/polmineR: Bright Side
2019IsAmongTopNSimilarDocuments

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average