pip install doctable
doctable consists of several classes for working with text data:
- DocTable is an object-oriented interface for querying and manipulating a single database table. overview/docs.
- ParsePipeline manages parallelized parsing pipelines using custom functions or pre-built doctable components. overview/docs/components.
- DocBootstrap makes it easy to bootstrap documents for estimating statistical models. overview.
- ParseTree is a compact structure for storing parsetree data from Spacy without the data overhead from Spacy models.
- Distribute does multiprocessing similar to multiprocessing.Pool() but allows for more datatypes to be passed to processes via IPC and distributes data in chunks instead of individual elements. Also features method for chunk-processing for insertion into DocTables. overview
Example: US National Security Strategy Documents
This demonstration shows a typical DocTable workflow. We show how to create a new DocTable, insert NSS document text and metadata, and parse data for storage in the table.
See NSS Example »