Databases for text analysis

doctable is a python package for parsing, storing, and accessing text documents/models in databases.

pip install doctable

doctable consists of several classes for working with text data:

  • DocTable provides an interface over db engines (like SQLite). Treat each DocTable like a database table.
  • Pipeline provides a way to create text processing pipelines which can be executed in parallel over large corpora. Use to work with pipeline components are in doctable.parse, as well as other custom functions.
  • ParseTree is a compact format for storing parsetree data from Spacy without the data overhead from Spacy models.
  • DocBootstrap allows you to conveniently bootstrap text data for use in text models.
  • Distribute does multiprocessing similar to multiprocessing.Pool() but allows for more datatypes to be passed to processes via IPC and distributes data in chunks instead of individual elements. Also features method for chunk-processing for insertion into DocTables.

Example: US National Security Strategy Documents


This demonstration shows a typical workflow using DocTable and DocParser. We use DocParser to download, tokenize, and create parsetrees for 17 national security strategy documents. We take advantage of DocParser to create a parallelized workflow for parsing, and a custom DocTable class for efficient storage.

See Intro to DocTable Example »


See Intro to ParseTrees Example »


Devin J. Cornell


Devin uses computational methods to study cultural processes through which organizations and individuals produce meaning.

Webpage
Research Blog
Twitter @devin_cornell

Distribute Class Examples

Class for distributing text processing tasks across multiple processes for insertion into databases. Works similar to multiprocessing.Pool() but handles pipes differently for larger data passing and chunk-level processing.