Databases for text analysis

doctable is a python package for parsing, storing, and accessing text documents/models in databases.

pip install doctable

doctable consists of several classes for working with text data:

  • DocTable is an object-oriented interface for querying and manipulating a single database table. overview/docs.
  • ParsePipeline manages parallelized parsing pipelines using custom functions or pre-built doctable components. overview/docs/components.
  • DocBootstrap makes it easy to bootstrap documents for estimating statistical models. overview.
  • ParseTree is a compact structure for storing parsetree data from Spacy without the data overhead from Spacy models.
  • Distribute does multiprocessing similar to multiprocessing.Pool() but allows for more datatypes to be passed to processes via IPC and distributes data in chunks instead of individual elements. Also features method for chunk-processing for insertion into DocTables. overview

Example: US National Security Strategy Documents

This demonstration shows a typical DocTable workflow. We show how to create a new DocTable, insert NSS document text and metadata, and parse data for storage in the table.

See NSS Example »

