database - Programmatically determine required series of JOIN statements in SQL -
in standard database theory on referential integrity , inclusion dependencies, it's alluded there pspace algorithm taking known list of e.g. foreign key , primary key relationships plus candidate relationship , determining if candidate relationship logically implied known list. non-deterministic version of algorithm given in algorithm 9.1.6 of this book [pdf], not explicit deterministic version (and i'm guessing idea isn't somehow manually use savitch's theorem find algorithm).
a related question how programmatically ask database system question this:
given result set query 1 , result set query 2, there known-in-advance chain of primary / foreign key dependencies allow me join 2 result sets (possibly requring additional columns come derived chain of key relationships).
i'm thinking in context of postgres, mssql, or mysql, how ask database information programmatically.
i can query sp_help
, sp_fkeys
stored procedures, it's unclear how can provide set of columns given table , set of columns target table, , directly ask database if can derive, outputs of things sp_fkeys
, way join tables each other.
i'm more willing (happy even) write code myself if there known algorithms (based on theorem mentioned above, @ least math algorithm exists if it's not part of database systems directly).
my application ingest assortment of tables poorly maintained schemas, , metaprogramming during schema discovery phase, there not need human-in-the-loop try manually search chains of key relationships imply way 2 given tables can joined.
suppose database these 3 tables:
tablea | a_key1 | a_key2 | a_value | +---------+---------+---------+ | 1 | 'foo' | 300 | | 2 | 'bar' | 400 | tableb | b_key1 | b_key2 | b_value | +---------+------------+---------+ | 'foo' | 2012-12-01 | 'bob' | | 'bar' | 2012-12-02 | 'joe' | tablec | c_key1 | c_key2 | c_value | +------------+---------+---------+ | 2012-12-01 | 100 | 3.4 | | 2012-12-02 | 200 | 2.7 |
and suppose b_key1
foreign key a_key2
, c_key1
foreign key b_key2
, , the "key" columns of each table, together, form primary key. it's not important whether tables normalized (so, instance, fact b_key1
can identify row in tablea
though tablea
ostensibly has 2-tuple key -- unimportant , happens tables inherit in wild).
then in case, knowing key relationships in advance, know following join "supported" (in sense data types , value ranges must make sense under system of foreign keys):
select a.*, c.c_value tablea join tableb b on a.a_key2 = b.b_key1 join table c on b.b_key2 = c.c_key1
so if person came database 2 separate queries:
select * tablea select c_key1, c_value tablec
then database system ought able deduce joins required query brings these 2 alignment (as implied foreign keys, not other arbitrary ways of making them alignment).
my question is: search functionality produce series of required joins exist in major sql-compliant rdbmss?
one can imagine following idea discussed in answer , sucking metadata other programming language, deconstructing kind of graph of key relationships, , using known graph path-finding algorithm solve problem. homebrewed implementations of fraught errors , it's kind of problem algorithms can written exponentially slow. asking if exists in database system already, i'm trying leverage work done make smart , database-aware implementation if work done.
let set of input queries q, , assume each of these queries selects subset of columns single table. i'll assume that, when joining table t in q table outside q, want consider fks can formed columns present in query t; when joining table outside q table outside q, can use fk defined between these 2 tables.
in case suggest want know if there unique set of joins generate combined column set of input queries. , can tested efficiently:
- form graph g having v = {all tables} , e = {(u,v) | there fk relationship u v or v u}, added restriction edges incident on tables in q can use columns present in corresponding input queries.
- build spanning forest f on g.
- if pair of tables in q appear in different components of f know there no query satisfies our goal, , can stop.
- otherwise, there @ least 1 such query, involving or of tables in unique component x containing tables in q.
- to rid of pointless joins, delete edges in x not on path table in q other table in q (e.g. repeatedly "nibbling away" pendant edges table outside of q, until can no longer done).
- for each edge e remains in new, smaller tree x:
- create copy x_e of induced subgraph of v(x) in g, , delete edge e x_e.
- create spanning forest f_e x_e.
- if f_e consists of single component, can mean there second (i.e. distinct) way of joining tables in q avoids edge e, can stop.
- if point, every edge e in x necessary, , can conclude set of joins implied x unique way produce set of columns input query set q.
Comments
Post a Comment