From Spreadsheets To Relational Databases and Back: Abstract
From Spreadsheets To Relational Databases and Back: Abstract
and Back?
1 Introduction
Spreadsheet tools can be viewed as programming environments for non-profes-
sional programmers. These so-called “end-user” programmers vastly outnumber
professional programmers [28].
As a programming language, spreadsheets lack support for abstraction, test-
ing, encapsulation, or structured programming. As a result, they are error-prone.
In fact, numerous studies have shown that existing spreadsheets contain redun-
dancy and errors at an alarmingly high rate [24, 27, 29, 30].
Spreadsheets are applications created by single end-users, without planning
ahead of time for maintainability or scalability. Still, after their initial creation,
many spreadsheets turn out to be used for storing and processing increasing
amounts of data and supporting increasing numbers of users over long periods
of time. To turn such spreadsheets into database-backed multi-user applications
with high maintainability is not a smooth transition, but requires substantial
time and effort.
In this paper, we develop techniques for smooth transitions between spread-
sheets and relational databases. The basis of these techniques is the fundamental
insight that spreadsheets and relational databases are formally connected by a
data refinement relation. To find this relation we discover functional dependen-
cies in spreadsheet data by data mining techniques. These functional dependen-
cies can be exploited to derive a relational database schema. We then apply data
?
This work is partially funded by the Portuguese Science Foundation (FCT) under
grants SFRH/BD/30231/2006 and SFRH/BSAB/782/2008.
calculation laws to the derived schema in order to reconstruct a sequence of re-
finement steps that connects the relational database schema back to the tabular
spreadsheet. Each refinement step at the schema level is witnessed by bidirec-
tional conversion steps at the data level, allowing data to be converted from
spreadsheet to database and vice versa. Our approach is to employ techniques
for bidirectional transformation of types, values, functions, and constraints [32],
based on data refinement theory [23].
We have implemented data refinement rules for converting between tabular
and relational datatypes as a library in the functional programming language
Haskell [25]. On this library, frontends were fitted for the exchange formats
used by the spreadsheet systems Excel and Gnumeric. We have constructed two
tools (a batch and an interactive version) to read, optimize, refactor and query
spreadsheets. The tools get as argument a spreadsheet in the Excel or Gnumeric
format and they have two different code generators: the SQL code generator,
that produces SQL code to create and populate the corresponding relational
database, and an Excel/Gnumeric code generator that produces a (transformed)
spreadsheet.
This paper is organized as follows: Section 2 presents a motivating example
that is used throughout the paper. Section 3 briefly discusses relational databases
and functional dependencies. In Section 4 we define data refinements and frame-
work for constraint-aware two-level transformation. In Section 5 we present the
refinement rules to map databases into spreadsheets. In Section 6 we describe the
libraries and tools constructed to transform and refactor spreadsheets. Section 7
discusses related work and Section 8 contains the conclusions. In Appendix we
show the API of our library.
2 Motivating Example
Throughout the paper we will use a well-known example spreadsheet taken
from [8] and reproduced in Figure 1. This sheet stores information about a
housing renting system, gathering information about clients, owners and rents.
It also stores prices and dates of renting. The name of each column gives a clear
idea of the information it represents.
For the sake of argument, we extend this example with two additional columns,
named totalDays (that computes the days of renting by subtracting the column
Fig. 2. The spreadsheet after applying the third normal form refactoring.
After establishing a mapping between the original spreadsheet and a rela-
tional database schema, we may want to use SQL to query the spreadsheet.
Regarding the house renting information, one may want to know who are the
clients of the properties that where rented between January, 2000 and January
2002? Such queries are difficult to formulate in the spreadsheet environment. In
SQL, the above question can be formulated as follows:
Below we will demonstrate that the automatically derived mapping can be ex-
ploited to fire such SQL queries at the original or the optimized spreadsheet.
In the next sections, we will formalize the correspondence between spread-
sheets and relational schemas using data refinement rules. We will present formal
proofs that guarantee their correctness. Moreover, we will present a framework
that implements the transformation rules and includes frontends to well-known
spreadsheet systems. In fact, the example presented in this section was processed
by our framework.
les
p
Tu
Fig. 3. An example of a relation that represents part of our example.
on the key attributes. A relation respects the 2NF if it is in the first normal form
(1NF) and its non-key attributes are not functionally dependent on part of the
key. Finally, the 1NF is respected if each element of each row contains only one
element.
In order to define the RDB schema, we use the data mining algorithm Fun
[18] to compute the FD given a spreadsheet, and then database techniques,
namely Maier’s algorithm [16], to compute the RDB schema in the 3NF.
We have expressed Fun as the Haskell f un function. Next, we execute
this function with our running example (the arguments propSch and propRel
correspond to the first and remaining lines of the spreadsheet, respectively).
∗ ghci ifun propSch propRel
ownerNo * oName
clientNo * cName
totalDays * clientNo, cName
propertyNo * pAddress, rentPerDay, ownerNo, oName
pAddress * propertyNo, rentPerDay, ownerNo, oName
...
The FDs derived by the Fun algorithm depend heavily on the quantity and
quality of the data. Thus, for small samples of data, or data that exhibits
too many or too few dependencies, the Fun algorithm may not produce the
desired FDs. For instance, in our running example and considering only the
data shown on Figure 1, the Fun algorithm does not induce the following FD
clientNo, propertyNo * rentStart, rentFinish, total rent, totalDays.
Functional dependencies are the basis for defining the RDB schema. The Fun
algorithm, however, may compute redundant FDs which may have a negative
impact on the design of the RDB schema. In this section, we discuss character-
istics of spreadsheets that can be used to define a more precise set of functional
dependencies.
Spreadsheets use formulas to define the values of some elements in terms
of other elements. For example, in the house renting spreadsheet, the column
totalDays is computed by subtracting the column rentFinish to rentStart, and
it is usually written as follows G3 = F3 - E3. This formula states that the
values of F3 and E3 determine the value of G3, thus inducing the following
functional dependency: rentStart, rentF inish * totalDays. Note also that
totalDays is the primary key of a FD produced by the Fun algorithm, namely
totalDays * clientNo, cName. Primary keys, however, must be constants rather
than formulas. Thus, such FDs should be eliminated.
Formulas can have references to other formulas. Consider, for example, the
second formula of the running example I3 = G3 * H3, which defines the total
rent by multiplying the number of days by the value of the rent. Because G3 is
defined by another formula, the values that determine G3 also determine I3. As
a result, the two formulas induce the following FDs:
rentStart, rentFinish, rentPerDay * total rent
rentStart, rentFinish * totalDays
Functional dependencies induced by formulas are added to the ones computed
by the Fun algorithm. In genereal a spreadsheet formula of the form X0 =
f (X1 , . . . , Xn ) induces the following functional dependency: X1 , . . . , Xn * X0 .
In spreadsheet systems, formulas are usually introduced by copying them through
all the elements in a column, thus making the FD explicit in all the elements.
This may not always be the case and some elements can be defined otherwise
(e.g. by using a constant value or a different formula). In this case, no functional
dependency is induced.
Having computed the functional dependencies, we can now construct the schema
of the RDB. Maier in [16] defined an algorithm called synthesize that receives
a set of FDs and returns a relational database schema respecting the 3NF.
begin synthesize :
Input a set of FDs F
Output a complete database schema for F
1. find a reduced, minimum annular cover G for F ;
2. for each CFD (X1 , X2 , ..., Xn ) * Y in G, construct a relational schema
R = X1 X2 ...Xn Y with designated keys K = {X1 , X2 , ..., Xn };
3. return the set of relational schemas constructed in step 2.
end synthesize
4 Constraint-aware Rewriting
As we have explained before, the mapping between the spreadsheet and the
RDB models is performed through data refinements using the 2LT system. Thus,
before we present the data refinement rules to map spreadsheets into databases,
let us briefly describe data refinements and the 2LT system3 .
be propagated to the global datatype in which they are embedded, that is, if
A 6to F to
from B then F A 6F from F B where F is a functor that models the context
of the transformation. A functor captures i) the embedding of local datatypes
inside global datatypes and ii) the lifting of value-level functions to and f rom
on the local datatypes to value-level transformations on the global datatypes.
In the particular case where the refinement works in both directions we have an
isomorphism A ∼ = B.
A common example is that maps are the implementation for lists [9] –
A 6seq2index
?
list N * A – where seq2index creates a map (or finite function,
here represented as M ap) where the keys are the indexes of the elements of the
list. list just collects the elements in the map. For more details about data re-
finements the reader is referred to [17, 19, 23].
Although the refinement are from a type a to a type b, this can not be directed
encoded since the type b is only known when the transformation completes, so
the type b is represented as a view of the type a. A view means that given a
function to which transforms a type a into a type b and a vice versa function
from it is possible to construct b from a.
These functions are represented in a point-free style, that is, without any
variables. Its representation is accomplished by the following GADT:
data PF a where
π1 :: PF ((a, b) → a)
π2 :: PF ((a, b) → b)
list2set :: PF ([a ] → Set a)
ρ :: PF ((a * b) → Set b)
δ :: PF ((a * b) → Set a)
CompList :: PF ([(a, b)] → [(b, b)])
ListId :: PF ([(a, b)] → [(b, b)])
·? :: PF (a → b) → PF ([a ] → [b ])
·? :: PF (a → b) → PF (Set a → Set b)
·∧· :: PF (Pred a) → PF (Pred a) → PF (Pred a)
·◦· :: PF (b → c) → PF (a → b) → PF (a → c)
·4· :: PF (a → b) → PF (a → c) → PF (a → (b, c))
·×· :: PF (a → b) → PF (c → d ) → PF ((a, c) → (b, d ))
·⊆· :: PF (a → Set b) → PF (a → Set b) → PF (Pred a)
Tables2table :: PF ((a * b, c * d ) → (a * b, Maybe d ))
Table2tables :: PF ((a * b, Maybe d ) → (a * b, c * d ))
Table2sstable :: PF ((a * b) → [(a, b)])
Sstable2table :: PF ([(a, b)] → (a * b))
...
To represent datatypes with constraints the following new Type constructor
is used:
·· :: Type a → PF (a → Bool ) → Type a
Its first argument is the type to constraint and the second one is the PF function
representing the constraint.
Each refinement rule can only be applied to a specific datatype, for instance,
a refinement on the type A can not be applied to the type A × B . To allow this
some rule-based combinators were created:
Using combinators, rules can be combined in order to create a full rewrite sys-
tems.
It is possible to transform a table (a map with key of type A and values of type
B) into a sstable (a list of tuples) and vice-versa as long as there is a constraint
imposing that there exists a FD between the elements in the column of type A
and the column of type B:
T able2sstable
+
A*Bk 6 (A × B)?list2set◦compList⊆list2set◦listId
Sstable2table
Here, list2set transforms a list into a set, compList sees a list as a relation
and compose it with its inverse. listId is the list resulting from transforming the
id relation into a list. This definition of FD is based on the definition of FD
presented in [21]. From now on this invariant will be designated fd . The proof
of this data refinement can be found in [22].
The rule is implemented in Haskell as follows.
table2sstable :: Rule
table2sstable (a * b)i = return $ View rep [a × b]inv 0
where
inv 0 = trySimplify (i ◦ Sstable2table ∧ fd )
rep = Rep{to = Table2sstable, from = Sstable2table }
where trySimplify is a rule that simplifies the invariant.
If the relational table has already an invariant, then it is composed with
the invariant of the new type concerning with the FD. The resulting invariant
is then simplified. This is just part of the rule’s implementation since another
function is necessary to encode the rule when the initial argument does not have
an associated invariant.
Let us use this rule to map the table with information about clients of our
running example.
∗ ghci itable2sstable (clientNo * cName)
Just (View (Rephtoihfromi) [clientNo × cName]fd )
The result of this rule is a datatype modelling a spreadsheet that constains
the attribute of the database. The invariant fd guarantees that the functional
dependency is now present in the datatype. Moreoever, the returned to and from
functions are the migration functions needed to map the data between the two
models.
A pair of tables where the primary key of the first table is a foreign key to the
primary key of the second table, can be refined to a pair of sstables using the
following law:
((A * B) × (C * D))πA ◦δ◦π1 ⊆πC ◦δ◦π2
G
Sstables2tables 6 T ables2sstable
((A × B)?f d × (C × D)?f d )πA ◦list2set◦π1? ◦π1 ⊆πC ◦list2set◦π1? ◦π2
The invariant guarantees that exists a FK from the PK of the first table
to the PK of the second table. The πA projection has type πA : A → E and
πC : C → E. A must be a tuple of the form A1 × . . . × E × . . . × An and C of the
form C1 × . . . × E × . . . × Cn . This allows that just part of the PK of each table
is used in the FK definition. The proof of this refinement corresponds to apply
twice the refinement presented in Section 5.1. The invariant must be updated
to work on the new type. The Haskell function table2sstable implements the
rule.
table2sstable :: Rule
table2sstable ((a * b) × (c * d ))πA ◦δ◦π1 ⊆πC ◦δ◦π2 =
return $ View rep ([a × b]fd × [c × d ]fd )inv
where
inv = πA ◦ list2set ◦ π1? ◦ π1 ⊆ πC ◦ list2set ◦ π1? ◦ π2
rep = Rep{to = Table2sstable × Table2sstable,
from = Sstable2table × Sstable2table }
A particular instance of this refinement occurs when πA is the identity func-
tion. In this case all the attributes of the PK of the first table, are FKs to part
of the attributes of the PK of the second table. Another instance of this refine-
ment is when πC = id, that is, part of the attributes of the PK of the first table
reference all the attributes of the PK of the second table. Both cases are very
similar to the general one and they are not shown here. A final instance of this
rule presents πA and πC has the identity function meaning that all the attributes
of the PK of the first table are FKs to all the attributes of the PK of the second
table. In this case the refinement changes as we show next.
T ables2table
+
((A * B) × (A * C))δ◦π1 ⊆δ◦π2 ∼
= A * (C × B?)
k
T able2tables
The values of the second table have a ?4 because it can be the case that
some keys in the second table are not referenced in the first one. This happens
because it is not always true that all the keys have a foreign usage. The proof of
such an isomorphism is as follows.
4
This symbol means optional. it is also representable as A + 1. In Haskell it is
represented by the datatype Maybe x = Just x | Nothing.
A * (C × B?)
∼
= = 1 * A}
{ A? ∼ (1)
A * (C × (1 * B))
∼
= = ((A * D) × (A × B * C))π? ◦δ◦π2 ⊆δ◦π1 }
{ A * (D × (B * C)) ∼ (2)
1
Proofs of rules 1, 2, and 3 can be found in [9], [4], and [23]. A function in
Haskell was created to define this isomorphism.
tables2table :: Rule
tables2table ((a * b) × (a * c))δ◦π1 ⊆δ◦π2 =
return $ View rep (a * c × Maybe b)
where
rep = Rep{to = Tables2table, from = Table2tables }
Note that each of these rules has a dual one, that is, for each rule refining a
pair A × B there exists another one refining the pair B × A, with the appropriate
invariant.
Sstables2tables0 6 T ables2sstables0
((A × B)?f d × (C × D)?f d )πB ◦list2set◦π2? ◦π1 ⊆πC ◦list2set◦π1? ◦π2
The proof of this refinement corresponds again to apply twice the refinement
presented in Section 5.1. The invariant must be updated to work on the new
type too. The function tables2sstables implements this refinement.
tables2sstables 0 ((a * b) × (c * d ))πB ◦ρ◦π1 ⊆πC ◦δ◦π2 =
return $ View rep ([a × b]fd × [c × d ]fd )πB ◦list2set◦π2? ◦π1 ⊆πC ◦list2set◦π1? ◦π2
where
rep = Rep{to = Table2sstable × Table2sstable,
from = Sstable2table × Sstable2table }
This refinement has three other particular cases. One where πB = id, another
where πC = id, and finally when πB = πC = id. In these cases the refinement is
the same, only the invariant changes. The proof and implementation of this rule
are very similar to the general case and so not shown here.
Let us consider again our running example. In order to use this rule we
consider two FDs where all the PK of second table is referenced by part of the
non-key attributes of the first one.
∗ ghci ilet prop = propertyNo * pAddress × rentPerDay × ownerNo
∗ ghci ilet owner = ownerNo * oName
∗ ghci imaps2tables (prop × owner )πownerNo ◦ρ◦π1 ⊆δ◦π2
Just (View (Rephtoihfromi)
(prop 0 × owner 0 )πownerNo ◦list2set◦π1? ◦π1 ⊆list2set◦π1? ◦π2 )
where
prop 0 = [propertyNo × pAddress × rentPerDay × ownerNo]
owner 0 = [ownerNo × oName]
In this case the two initial pairs are transformed into two lists. The constraint
is updated to work with such structures.
The individual refinement rules can be combined into a compound rules and full
transformation systems using the strategy combinators shown in Section 4.2. In
particular, we define a compound rule to map a RDB to a spreadsheet:
rdb2ss :: Rule
rdb2ss = simplify B
(many ((aux tables2sstables) (aux tables2sstables 0 ))) B
(many (aux table2sstable))
where
aux r = ((once r ) B simplify) B many ((once r ) B simplify)
This function starts by simplifying the invariants. Then the tables2sstables rule
(defined in Section 5.2) is applied exhaustively (with aux ) to transform tables
into sstables. After that the tables2sstables 0 rule (Section 5.3) is applied. In a final
step, the remaining maps are transformed using the table2sstable (Section 5.1)
rule. After each rule has been applied a simplification step is executed. This
strategy requires the simplification of the invariants because pattern matching
is performed not only on the type representations, but also on the invariants.
The simplify rule is defined as follows:
simplify :: Rule
simplify = many (prodsrdb mapsrdb others myRules) B compr
The functions prodsrdb, mapsrdb, others, and myRules are functions that simplify
invariants. We include here the code of the first one. It simplifies products:
prod def :: Rule
prod def (Func (a × b) ) (f × g) = success "Prod-Def" ((f ◦ π1 )4(g ◦ π2 ))
prod def = fail "Cannot apply Prod-Def!"
Tools Two spreadsheet tools: A batch and a online tool that allow the users to
read, transform, refactor and query spreadsheets.
The SQL Backend: This backend generates SQL code which creates the database
according to the derived RDB schema. This is basically a simple SQL create
instruction based on the RDB schema. Furthermore, it produces SQL code to
insert the migrated data in the database, and, again, this corresponds to a SQL
insert instruction with the migrated data as argument. Because some values
of the spreadsheet are defined through formulas, we generate also a SQL trigger
that models the spreadsheet formulas which are used to update the database and
guarantee its integrity. Next, we present the trigger induced by the two formulas
of our running example:
create trigger ssformulas before insert on tbl
for each row begin
set new . totalDays = new . rentFinish − new . rentStart;
set new . total rent = new . rentPerDay ∗ new . totalDays;
end ;
The SQL code to create the database and insert the migrated data can be
obtained from the HaExcel webpage. We omit them here.
7 Related Work
8 Conclusions
In this paper, we have explored the relation between spreadsheets and relational
databases. In particular, we have made the following contributions.
1. We have extended the 2LT framework for two-level transformations with
constraint-aware
2. conversion rules between relational and tabular data structures. The correct-
ness of these rules is supported with proofs.
3. We have shown how these rules can be combined into a strategic rewrite
4. system that transforms a relational schema into a tabular schema and on
the fly derives conversion functions between these schemas.
5. We have combined this rewrite system with methods for discovering tables
in spreadsheets, deriving functional dependencies from the data contained in
them, and deriving relational schemas from these functional dependencies.
6. To these algorithmic components, we have connected importers and ex-
porters for SQL and spreadsheet formats.
7. We have shown how the resulting system can be employed to convert spread-
sheets to relational databases and back. This allows refactoring of spread-
sheets to reduce data redundancy, migration of spreadsheet applications to
database applications, and advanced querying of spreadsheets with SQL
queries.
Notwithstanding these contributions, our approach presents a number of limita-
tions that we hope to remove in future. For example, we are currently supporting
only set of commonly used formulas, which remains to be enlarged to a wider
range.
More importantly, the algorithm for inferring functional dependencies needs
improvement. This algorithm does not work well for small sets of data and is
sensitive to “accidental” patterns in the data. Some well-chosen heuristics and
limited user interaction could alleviate this problem.
In the two-level rewrite system, syntactic matching is performed on repre-
sentations of constraints. Such syntactic matching could be generalized to verifi-
cation of logical implication of the actual constraint and the required constraint.
With respect to formulas and queries, we are not yet exploiting some inter-
esting opportunities. For example, a formula or query expressed in terms of the
source spreadsheet may be composed with the backward conversion function to
obtain a query on the target database or refactored spreadsheet. Such migrations
of queries have been explored in [10].
Finally, we are currently ignoring the possibility of having repeated rows in
the source spreadsheet. Also the order of rows is not taken into account. Thus,
we are effectively treating spreadsheet tables as sets of rows rather than bags or
lists of rows. The data refinement theory that underpins our approach can be
exploited to support these alternative perspectives where order and repetition
are considered relevant.
Availability The HaExcel library and tools are available from the homepage of
the first author.
References
A API
-- type definitions
type R a = Attributes a
type Relation a = [[a ]]
data Attributes a = Atts{atts :: [Attribute a ]}
data Attribute a = Att{att :: a }
type FD a = (Attributes a, Attributes a)
type FDS a = [FD a ]
type CFD a = (LeftSets a, RightSets a)
type CFDS a = [CFD a ]
type LeftSets a = [Attributes a ]
type RightSets a = Attributes a
type LKS a = [LK a ]
data LK a = LK {candidate :: Candidate a, cardinality :: Int,
quasiClosure :: Attributes a, closure :: Attributes a }
type Candidate a = Attributes a
data Box where Box :: Type a → a → Box
-- functions to manipulate FDs
fun :: Monad m ⇒ R b → Relation a → m (LKs b) -- implementation of Fun
lks2fds :: Eq a ⇒ LKS a → FDS a -- transforms LKS into FDS
rel2rdb :: R a → Relation b → Box -- produces a RDB schema
synthesize :: FDS a → CFDS a -- implementation of synthesize
-- import/export functions
readSS :: FilePath → IO [Sheet Fml ] -- reads a SS file; returns tables
ss2boxes :: FilePath → IO [(Maybe Box , Maybe Box )] -- returns type tables
ss2rdb :: FilePath → IO [Maybe Box ] -- SS to RDB type
ss2SQL :: FilePath → FilePath → IO () -- SS to an SQL script
genSQL :: [(Maybe Box , Maybe Box )] → [[String ]] -- produces SQL queries
ss2DB :: FilePath → MySQLOptions → IO () -- SS to a MySQL DB
genDB :: [(Maybe Box , Maybe Box )] → MySQLOptions → IO () -- gives a DB
ss2SS :: FilePath → FilePath → IO () -- SS to a new SS
genSS :: [Maybe Box ] → Gmr 0 Workbook → Gmr 0 Workbook -- produces a SS
-- refinement law functions
table2sstable :: Rule -- rule from Section 5.1
tables2table Rule -- rule from Section 5.2
tables2sstables :: Rule -- rule from Section 5.3
rdb2ss :: Rule -- strategy: rule from Section 5.4