DocTable Examples: Select

Here I show how to select data from a DocTable. We cover object-oriented conditional selects emulating the WHERE SQL clause, as well as some reduce functions.

In [1]:
import random
import pandas as pd
import numpy as np
import sys
sys.path.append('..')
import doctable as dt
In [2]:
schema = (
    ('integer','id',dict(primary_key=True, autoincrement=True)),
    ('string','name', dict(nullable=False)),
    ('integer','age'),
    ('boolean', 'is_old'),
)
db = dt.DocTable(schema, tabname='mydocuments', verbose=True)
# defaults: #fname=':memory:', engine='sqlite', persistent_conn=True, new_db=True
# fname=':memory:' is special - it loads database into memory
print(db)
DocTable2 Query: SELECT count() AS count_1 
FROM mydocuments
 LIMIT :param_1
<DocTable2::mydocuments ct: 0>
In [3]:
N = 10
for i in range(N):
    age = random.random() # number in [0,1]
    is_old = age > 0.5
    db.insert({'name':'user_'+str(i), 'age':age, 'is_old':is_old}, verbose=False)
print(db)
DocTable2 Query: SELECT count() AS count_1 
FROM mydocuments
 LIMIT :param_1
<DocTable2::mydocuments ct: 10>

Regular Selects

These functions all return lists of ResultProxy objects. As such, they can be accessed using numerical indices or keyword indices. For instance, if one select output row is row=(1, 'user_0') (after selecting "id" and "user"), it can be accessed such that row[0]==row['id'] and row[1]==row['user'].

In [4]:
# the limit argument means the result will only return some rows.
# I'll use it for convenience in these examples.
# this selects all rows
db.select(limit=2)
DocTable2 Query: SELECT mydocuments.id, mydocuments.name, mydocuments.age, mydocuments.is_old 
FROM mydocuments
 LIMIT :param_1
Out[4]:
[(1, 'user_0', 0.5302777968381023, True),
 (2, 'user_1', 0.43948787762004426, False)]
In [5]:
db.select(['id','name'], limit=1)
DocTable2 Query: SELECT mydocuments.id, mydocuments.name 
FROM mydocuments
 LIMIT :param_1
Out[5]:
[(1, 'user_0')]
In [6]:
# can also select by accessing the column object (db['id']) itself
# this will be useful later with more complex queries
db.select([db['id'],db['name']], limit=1)
DocTable2 Query: SELECT mydocuments.id, mydocuments.name 
FROM mydocuments
 LIMIT :param_1
Out[6]:
[(1, 'user_0')]
In [7]:
db.select_first()
DocTable2 Query: SELECT mydocuments.id, mydocuments.name, mydocuments.age, mydocuments.is_old 
FROM mydocuments
 LIMIT :param_1
Out[7]:
(1, 'user_0', 0.5302777968381023, True)
In [8]:
db.select('name',limit=5)
DocTable2 Query: SELECT mydocuments.name 
FROM mydocuments
 LIMIT :param_1
Out[8]:
['user_0', 'user_1', 'user_2', 'user_3', 'user_4']
In [9]:
db.select_first('age')
DocTable2 Query: SELECT mydocuments.age 
FROM mydocuments
 LIMIT :param_1
Out[9]:
0.5302777968381023

Conditional Selects

In [10]:
db.select(where=db['id']==2)
DocTable2 Query: SELECT mydocuments.id, mydocuments.name, mydocuments.age, mydocuments.is_old 
FROM mydocuments 
WHERE mydocuments.id = :id_1
Out[10]:
[(2, 'user_1', 0.43948787762004426, False)]
In [11]:
db.select(where=db['id']<3)
DocTable2 Query: SELECT mydocuments.id, mydocuments.name, mydocuments.age, mydocuments.is_old 
FROM mydocuments 
WHERE mydocuments.id < :id_1
Out[11]:
[(1, 'user_0', 0.5302777968381023, True),
 (2, 'user_1', 0.43948787762004426, False)]
In [12]:
# mod operator works too
db.select(where=(db['id']%2)==0, limit=2)
DocTable2 Query: SELECT mydocuments.id, mydocuments.name, mydocuments.age, mydocuments.is_old 
FROM mydocuments 
WHERE mydocuments.id % :id_1 = :param_1
 LIMIT :param_2
Out[12]:
[(2, 'user_1', 0.43948787762004426, False),
 (4, 'user_3', 0.398902981426976, False)]
In [13]:
# note parantheses to handle order of ops with overloaded bitwise ops
db.select(where= (db['id']>=2) & (db['id']<=4) & (db['name']!='user_2'))
DocTable2 Query: SELECT mydocuments.id, mydocuments.name, mydocuments.age, mydocuments.is_old 
FROM mydocuments 
WHERE mydocuments.id >= :id_1 AND mydocuments.id <= :id_2 AND mydocuments.name != :name_1
Out[13]:
[(2, 'user_1', 0.43948787762004426, False),
 (4, 'user_3', 0.398902981426976, False)]
In [14]:
db.select(where=db['name'].in_(('user_2','user_3')))
DocTable2 Query: SELECT mydocuments.id, mydocuments.name, mydocuments.age, mydocuments.is_old 
FROM mydocuments 
WHERE mydocuments.name IN (:name_1, :name_2)
Out[14]:
[(3, 'user_2', 0.13789672737843706, False),
 (4, 'user_3', 0.398902981426976, False)]
In [15]:
db.select(where=db['id'].between(2,4))
DocTable2 Query: SELECT mydocuments.id, mydocuments.name, mydocuments.age, mydocuments.is_old 
FROM mydocuments 
WHERE mydocuments.id BETWEEN :id_1 AND :id_2
Out[15]:
[(2, 'user_1', 0.43948787762004426, False),
 (3, 'user_2', 0.13789672737843706, False),
 (4, 'user_3', 0.398902981426976, False)]
In [16]:
# use of logical not operator "~"
db.select(where= ~(db['name'].in_(('user_2','user_3'))) & (db['id'] < 4))
DocTable2 Query: SELECT mydocuments.id, mydocuments.name, mydocuments.age, mydocuments.is_old 
FROM mydocuments 
WHERE mydocuments.name NOT IN (:name_1, :name_2) AND mydocuments.id < :id_1
Out[16]:
[(1, 'user_0', 0.5302777968381023, True),
 (2, 'user_1', 0.43948787762004426, False)]
In [17]:
# more verbose operators .and_, .or_, and .not_ are bound to the doctable package
db.select(where= dt.or_(dt.not_(db['id']==4)) & (db['id'] <= 2))
DocTable2 Query: SELECT mydocuments.id, mydocuments.name, mydocuments.age, mydocuments.is_old 
FROM mydocuments 
WHERE mydocuments.id != :id_1 AND mydocuments.id <= :id_2
Out[17]:
[(1, 'user_0', 0.5302777968381023, True),
 (2, 'user_1', 0.43948787762004426, False)]
In [18]:
# now with simple computation
ages = db.select(db['age'])
mean_age = sum(ages)/len(ages)
db.select(db['name'], where=db['age']>mean_age, limit=2)
DocTable2 Query: SELECT mydocuments.age 
FROM mydocuments
DocTable2 Query: SELECT mydocuments.name 
FROM mydocuments 
WHERE mydocuments.age > :age_1
 LIMIT :param_1
Out[18]:
['user_0', 'user_4']
In [19]:
# apply .label() method to columns
dict(db.select_first([db['age'].label('myage'), db['name'].label('myname')]))
DocTable2 Query: SELECT mydocuments.age AS myage, mydocuments.name AS myname 
FROM mydocuments
 LIMIT :param_1
Out[19]:
{'myage': 0.5302777968381023, 'myname': 'user_0'}

Column Operators

I bind the .min, .max, .count, .sum, and .mode methods to the column objects. Additionally, I move the .count method to a separate DocTable2 method.

In [20]:
db.select_first([db['age'].sum, db['age'].count, db['age']])
DocTable2 Query: SELECT sum(mydocuments.age) AS sum_1, count(mydocuments.age) AS count_1, mydocuments.age 
FROM mydocuments
 LIMIT :param_1
Out[20]:
(4.689441695567935, 10, 0.5302777968381023)
In [21]:
# with labels now
dict(db.select_first([db['age'].sum.label('sum'), db['age'].count.label('ct')]))
DocTable2 Query: SELECT sum(mydocuments.age) AS sum, count(mydocuments.age) AS ct 
FROM mydocuments
 LIMIT :param_1
Out[21]:
{'sum': 4.689441695567935, 'ct': 10}

ORDER BY, GROUP BY, LIMIT

These additional arguments have also been provided.

In [22]:
# the limit is obvious - it has been used throughout these examples
db.select(limit=2)
DocTable2 Query: SELECT mydocuments.id, mydocuments.name, mydocuments.age, mydocuments.is_old 
FROM mydocuments
 LIMIT :param_1
Out[22]:
[(1, 'user_0', 0.5302777968381023, True),
 (2, 'user_1', 0.43948787762004426, False)]
In [23]:
# orderby clause
db.select(orderby=db['age'].desc(), limit=2)
DocTable2 Query: SELECT mydocuments.id, mydocuments.name, mydocuments.age, mydocuments.is_old 
FROM mydocuments ORDER BY mydocuments.age DESC
 LIMIT :param_1
Out[23]:
[(9, 'user_8', 0.9428636080674738, True),
 (10, 'user_9', 0.7836575155583511, True)]
In [24]:
# compound orderby
db.select(orderby=(db['age'].desc(),db['is_old'].asc()), limit=2)
DocTable2 Query: SELECT mydocuments.id, mydocuments.name, mydocuments.age, mydocuments.is_old 
FROM mydocuments ORDER BY mydocuments.age DESC, mydocuments.is_old ASC
 LIMIT :param_1
Out[24]:
[(9, 'user_8', 0.9428636080674738, True),
 (10, 'user_9', 0.7836575155583511, True)]
In [25]:
# can also use column name directly
# can only use ascending and can use only one col
db.select(orderby='age', limit=2)
DocTable2 Query: SELECT mydocuments.id, mydocuments.name, mydocuments.age, mydocuments.is_old 
FROM mydocuments ORDER BY mydocuments.age
 LIMIT :param_1
Out[25]:
[(6, 'user_5', 0.04169970674398138, False),
 (3, 'user_2', 0.13789672737843706, False)]
In [26]:
# groupby clause
# returns first row of each group without any aggregation functions
db.select(groupby=db['is_old'])
DocTable2 Query: SELECT mydocuments.id, mydocuments.name, mydocuments.age, mydocuments.is_old 
FROM mydocuments GROUP BY mydocuments.is_old
Out[26]:
[(2, 'user_1', 0.43948787762004426, False),
 (1, 'user_0', 0.5302777968381023, True)]
In [27]:
# compound groupby (weird example bc name is unique - have only one cat var in this demo)
db.select(groupby=(db['is_old'],db['name']), limit=3)
DocTable2 Query: SELECT mydocuments.id, mydocuments.name, mydocuments.age, mydocuments.is_old 
FROM mydocuments GROUP BY mydocuments.is_old, mydocuments.name
 LIMIT :param_1
Out[27]:
[(2, 'user_1', 0.43948787762004426, False),
 (3, 'user_2', 0.13789672737843706, False),
 (4, 'user_3', 0.398902981426976, False)]
In [28]:
# groupby clause using max aggregation function
# gets match age for both old and young groups
db.select(db['age'].max, groupby=db['is_old'])
DocTable2 Query: SELECT max(mydocuments.age) AS max_1 
FROM mydocuments GROUP BY mydocuments.is_old
Out[28]:
[0.43948787762004426, 0.9428636080674738]

SQL String Commands and Additional Clauses

For cases where DocTable2 does not provide a convenient interface, you may submit raw SQL commands. These may be a bit more unwieldly, but they offer maximum flexibility. They may be used either as simply an addition to the WHERE or arbitrary end clauses, or accessed in totality.

In [29]:
qstr = 'SELECT age,name FROM {} WHERE id=="{}"'.format(db.tabname, 1)
results = db.execute(qstr)
dict(list(results)[0])
DocTable2 Query: SELECT age,name FROM mydocuments WHERE id=="1"
Out[29]:
{'age': 0.5302777968381023, 'name': 'user_0'}
In [30]:
whrstr = 'is_old=="{}"'.format('1')
db.select(whrstr=whrstr, limit=2)
DocTable2 Query: SELECT mydocuments.id, mydocuments.name, mydocuments.age, mydocuments.is_old 
FROM mydocuments 
WHERE is_old=="1"
 LIMIT :param_1
Out[30]:
[(1, 'user_0', 0.5302777968381023, True),
 (5, 'user_4', 0.6563434119131013, True)]
In [31]:
# combine whrstr with structured query where clause
whrstr = 'is_old=="{}"'.format('1')
db.select(where=db['id']<=5, whrstr=whrstr)
DocTable2 Query: SELECT mydocuments.id, mydocuments.name, mydocuments.age, mydocuments.is_old 
FROM mydocuments 
WHERE mydocuments.id <= :id_1 AND is_old=="1"
Out[31]:
[(1, 'user_0', 0.5302777968381023, True),
 (5, 'user_4', 0.6563434119131013, True)]
In [32]:
# combine whrstr with structured query where clause
whrstr = 'is_old=="{}"'.format('1')
db.select(where=db['id']<=5, whrstr=whrstr)
DocTable2 Query: SELECT mydocuments.id, mydocuments.name, mydocuments.age, mydocuments.is_old 
FROM mydocuments 
WHERE mydocuments.id <= :id_1 AND is_old=="1"
Out[32]:
[(1, 'user_0', 0.5302777968381023, True),
 (5, 'user_4', 0.6563434119131013, True)]

Count Method and Get Next ID

.count() is a convenience method. Mostly the same could be accomplished by db.select_first(db['id'].count), but this requires no reference to a specific column.

.next_id() is especially useful if one hopes to enter the id (or any primary key column) into new rows manually. Especially useful because SQL engines don't provide new ids except when a single insert is performed.

In [33]:
db.count()
DocTable2 Query: SELECT count() AS count_1 
FROM mydocuments
 LIMIT :param_1
Out[33]:
10
In [34]:
db.count(db['age'] < 0.5)
DocTable2 Query: SELECT count() AS count_1 
FROM mydocuments 
WHERE mydocuments.age < :age_1
 LIMIT :param_1
Out[34]:
6

Select as Pandas Series and DataFrame

These are especially useful when working with metadata because Pandas provides robust descriptive and plotting features than SQL alone. Good for generating sample information.

In [35]:
# must provide only a single column
db.select_series(db['age']).head(2)
DocTable2 Query: SELECT mydocuments.age 
FROM mydocuments
Out[35]:
0    0.530278
1    0.439488
dtype: float64
In [36]:
db.select_series(db['age']).quantile([0.025, 0.985])
DocTable2 Query: SELECT mydocuments.age 
FROM mydocuments
Out[36]:
0.025    0.063344
0.985    0.921371
dtype: float64
In [37]:
db.select_df(['id','age']).head(2)
DocTable2 Query: SELECT mydocuments.id, mydocuments.age 
FROM mydocuments
Out[37]:
id age
0 1 0.530278
1 2 0.439488
In [38]:
db.select_df('age').head(2)
DocTable2 Query: SELECT mydocuments.age 
FROM mydocuments
Out[38]:
age
0 0.530278
1 0.439488
In [39]:
# must provide list of cols (even for one col)
db.select_df([db['id'],db['age']]).corr()
DocTable2 Query: SELECT mydocuments.id, mydocuments.age 
FROM mydocuments
Out[39]:
id age
id 1.00000 0.42015
age 0.42015 1.00000
In [40]:
db.select_df([db['id'],db['age']]).describe().T
DocTable2 Query: SELECT mydocuments.id, mydocuments.age 
FROM mydocuments
Out[40]:
count mean std min 25% 50% 75% max
id 10.0 5.500000 3.027650 1.0000 3.250000 5.500000 7.750000 10.000000
age 10.0 0.468944 0.274453 0.0417 0.364492 0.422389 0.624827 0.942864
In [41]:
mean_age = db.select_series(db['age']).mean()
df = db.select_df([db['id'],db['age']])
df['old_grp'] = df['age'] > mean_age
df.groupby('old_grp').describe()
DocTable2 Query: SELECT mydocuments.age 
FROM mydocuments
DocTable2 Query: SELECT mydocuments.id, mydocuments.age 
FROM mydocuments
Out[41]:
id age
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
old_grp
False 6.0 5.00 2.366432 2.0 3.25 5.0 6.75 8.0 6.0 0.296050 0.164948 0.041700 0.191678 0.375962 0.403693 0.439488
True 4.0 6.25 4.112988 1.0 4.00 7.0 9.25 10.0 4.0 0.728286 0.176534 0.530278 0.624827 0.720000 0.823459 0.942864
In [42]:
# more complicated groupby aggregation.
# calculates the variance both for entries above and below average age
mean_age = db.select_series(db['age']).mean()
df = db.select_df([db['name'],db['age']])
df['old_grp'] = df['age']>mean_age
df.groupby('old_grp').agg(**{
    'first_name':pd.NamedAgg(column='name', aggfunc='first'),
    'var_age':pd.NamedAgg(column='age', aggfunc=np.var),
})
DocTable2 Query: SELECT mydocuments.age 
FROM mydocuments
DocTable2 Query: SELECT mydocuments.name, mydocuments.age 
FROM mydocuments
Out[42]:
first_name var_age
old_grp
False user_1 0.027208
True user_0 0.031164

Select with Buffer

In cases where you have many rows or each row contains a lot of data, you may want to perform a select query which makes requests in chunks. This is performed using the SQL OFFSET command, and querying up to buffsize while yielding each returned row. This system is designed this way because the underlying sql engine buffers all rows retreived from a query, and thus there is no way to stream data into memory without this system.

NOTE: The limit keyword is incompatible with this method - it will return all results. A workaround is to use the approx_max_rows param, which will return at minimum this number of rows, at max the specified number of rows plus buffsize.

In [43]:
for row in db.select_chunk(chunksize=2, max_rows=3, where=(db['id']%2)==0):
    print(row)
DocTable2 Query: SELECT mydocuments.id, mydocuments.name, mydocuments.age, mydocuments.is_old 
FROM mydocuments 
WHERE mydocuments.id % :id_1 = :param_1
 LIMIT :param_2 OFFSET :param_3
(2, 'user_1', 0.43948787762004426, False)
(4, 'user_3', 0.398902981426976, False)
DocTable2 Query: SELECT mydocuments.id, mydocuments.name, mydocuments.age, mydocuments.is_old 
FROM mydocuments 
WHERE mydocuments.id % :id_1 = :param_1
 LIMIT :param_2 OFFSET :param_3
(6, 'user_5', 0.04169970674398138, False)