pip install doctable
doctable consists of two Python classes: DocTable and DocParser. DocTable is an object-oriented interface for working with SQL tables. Instantiate a doctable object to create or connect to an existing database table and use .insert(), .select(), .update(), and .delete() methods to manipulate its contents. DocParser is a class with static methods for tokenization and parsetree extraction with spaCy, and other methods to distribute parsing functions and insert results into a DocTable (or other format).
This demonstration shows a typical workflow using DocTable and DocParser. We use DocParser to download, tokenize, and create parsetrees for 17 national security strategy documents. We take advantage of DocParser to create a parallelized workflow for parsing, and a custom DocTable class for efficient storage.See NSS Example »
Here we show how to parse a large corpus of more than 60 thousand books.DocParser Overview »
Static methods for tokenizing and creating parsetrees using the SpaCy text analysis package. It includes methods to assist with largely parallel text preprocessing and processing for storage into a DocTable database.
Class for distributing text processing tasks across multiple processes for insertion into databases. Works similar to multiprocessing.Pool() but handles pipes differently for larger data passing and chunk-level processing.