Databases for text analysis

doctable is a python package for parsing, storing, and accessing text documents and models in databases.

pip install doctable

doctable consists of two Python classes: DocTable and DocParser. DocTable is an object-oriented interface for working with SQL tables. Instantiate a doctable object to create or connect to an existing database table and use .insert(), .select(), .update(), and .delete() methods to manipulate its contents. DocParser is a class with static methods for tokenization and parsetree extraction with spaCy, and other methods to distribute parsing functions and insert results into a DocTable (or other format).

Example: US National Security Strategy Documents

This demonstration shows a typical workflow using DocTable and DocParser. We use DocParser to download, tokenize, and create parsetrees for 17 national security strategy documents. We take advantage of DocParser to create a parallelized workflow for parsing, and a custom DocTable class for efficient storage.

See NSS Example »

Example: Gutenberg Texts

Here we show how to parse a large corpus of more than 60 thousand books.

DocParser Overview »

Devin J. Cornell

Devin uses computational methods to study cultural processes through which organizations and individuals produce and are shaped by meaning.

Research Blog
Twitter @devin_cornell

DocParser Class Examples

Static methods for tokenizing and creating parsetrees using the SpaCy text analysis package. It includes methods to assist with largely parallel text preprocessing and processing for storage into a DocTable database.

Distribute Class Examples

Class for distributing text processing tasks across multiple processes for insertion into databases. Works similar to multiprocessing.Pool() but handles pipes differently for larger data passing and chunk-level processing.