bugsql/cvs-verify.html

<?xml version="1.0" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>db_validate.pl - Simple validation for DBD::CSV database creation</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<link rev="made" href="mailto:trevor@bert.muppets.liwing.de." />
</head>

<body style="background-color: white">


<!-- INDEX BEGIN -->
<div name="index">
<p><a name="__index__"></a></p>

<ul>

	<li><a href="#name">NAME</a></li>
	<li><a href="#synopsis">SYNOPSIS</a></li>
	<li><a href="#description">DESCRIPTION</a></li>
	<li><a href="#schema">Schema</a></li>
	<ul>

		<li><a href="#table_fields">Table Fields</a></li>
		<li><a href="#metadata">Metadata</a></li>
	</ul>

	<li><a href="#database_creation">DATABASE CREATION</a></li>
	<ul>

		<li><a href="#creating_a_schema">Creating a schema</a></li>
		<li><a href="#creating_the_raw_data_for_the_database">Creating the raw data for the database</a></li>
	</ul>

	<li><a href="#validating_an_existing_database">Validating an existing database</a></li>
	<li><a href="#copyright">COPYRIGHT</a></li>
	<li><a href="#author">AUTHOR</a></li>
	<li><a href="#bugs">BUGS</a></li>
	<li><a href="#see_also">SEE ALSO</a></li>
</ul>

<hr name="index" />
</div>
<!-- INDEX END -->

<p>
</p>
<h1><a name="name">NAME</a></h1>
<p>db_validate.pl - Simple validation for DBD::CSV database creation</p>
<p>
</p>
<hr />
<h1><a name="synopsis">SYNOPSIS</a></h1>
<pre>
 perl db_validate.pl -s schema.txt -r raw_data.txt -d data_dir</pre>
<p>
</p>
<hr />
<h1><a name="description">DESCRIPTION</a></h1>
<p>DBD::CSV is a useful module if you are forced to work with CSV files as a
database.  Unfortunately, the module does no data validation.  This
program allows you to define a schema, along with UNIQUE fields and foreign
key constraints.  Further, basic data type validation for INT, CHAR, and
VARCHAR is supported.  Oddly enough, DBD::CSV, while only allowing those
datatypes to be used in a <code>CREATE table</code> statement, does not actually check
to see whether or not the data you are inserting matches those datatypes.</p>
<p>If one supplies the -v option on the command line, then this program validates
and existing database.  The -r option is then used to specify where the
raw_data will be written to.  The -n (&quot;no header&quot;) option may be used when
validating if you are validating a CSV file while does not have field names
listed as the first line of the file.</p>
<p>Much of the information provided in here is for those who wish to maintain and
extend this program.  These sections are marked with <strong>MAINTENANCE BEGIN</strong> and
<strong>MAINTENANCE END</strong> and may be skipped if you do not care about this.</p>
<p>
</p>
<hr />
<h1><a name="schema">Schema</a></h1>
<p>Each database to be created and validated must have an associated schema
created by the programmer.</p>
<p>The schema location is passed to the program via the command line -s switch.
The schema is then read into the <code>$schema</code> variable in <code>&amp;schema</code>. <code>$schema</code>
is a hash ref.  Every key in the hash that does not begin with an underscore
(_) is assumed to be a table name.  Each table is also a hashref.  Every key
in a table hash that does not begin with an underscore is assumed to be a
field name.  Keys beginning with underscores are metadata describing their
respectives tables.  These keys are <a href="#_field_order"><code>_field_order</code></a>, <a href="#_foreign_key"><code>_foreign_key</code></a>,
<a href="#_unique"><code>_unique</code></a>, and <code>_rec_count</code>.  These are used to validate the data in the
database.</p>
<p>The format of the schema file must be as follows (items enclosed in curly
braces are optional):</p>
<pre>
 [tablename]
 field DATATYPE {UNIQUE} {/regex/}
 _foreign_key tablename.field foreigntable foreigntable.key</pre>
<p>Currently, the only datatypes supported are INT, CHAR, and VARCHAR.  CHAR and
VARCHAR must have a number in parenthese following them which specifies the
maximum number of characters allowed.  For example:</p>
<pre>
 [users]
 user_id   INT UNIQUE
 user_name VARCHAR(10)</pre>
<p>In the schema, blank lines and lines starting with a sharp (#) are skipped.
Fields in the database will be in the order listed.  As soon as a new table
name is encountered, the previous table definition is assumed to be complete.</p>
<p>A regex may be supplied as the last item on the field definition line.  If a
field value does not match the regex, a warning will be written to the error
log.  For example to ensure that user ids are positive integers:</p>
<pre>
 [users]
 user_id INT UNIQUE /^\d+$/</pre>
<p>The regex <strong>must</strong> begin and end with a forward slash.  You can use a regex such
as <code>/.+/</code> to enforce a <code>NOT NULL</code> requirement.</p>
<p>Also note that you may write <code>INTEGER</code> instead of <code>INT</code>, if you prefer.</p>
<p>
</p>
<h2><a name="table_fields">Table Fields</a></h2>
<p>Each field in a table may be of type INT, VARCHAR, or CHAR.  These are the only
datatypes supported by DBD::CSV when creating tables.  These datatypes are
stored in <code>$int</code>, <code>$varchar</code>, and <code>$char</code>, respectively.  Let's examine the
following SQL <code>CREATE</code> statement and see how it's translated:</p>
<pre>
 CREATE TABLE users (
    user_id INTEGER UNIQUE,
    name    VARCHAR(64),
    area    CHAR(8)
 )</pre>
<p><strong>MAINTENANCE BEGIN</strong></p>
<p>The following is the minimum legal specification for this table, as defined in
the <code>$schema</code> variable.</p>
<pre>
 users =&gt; {
     user_id      =&gt; $int,
     name         =&gt; &quot;$varchar(64)&quot;,
     area         =&gt; &quot;$char(8)&quot;,
     _field_order =&gt; [qw/ user_id name area /]
 }</pre>
<p>Upon validating the data, and exception will be thrown if <code>user_id</code> does not
match the regular expression <code>/^-?\d+$/</code>.</p>
<p>The <code>name</code> would throw an exception if it exceeded the length of 64 and
<code>area</code> would throw an exception if the length did not match 8.  When the table
is created, the order of fields in the table would match the order specified in
<a href="#_field_order"><code>_field_order</code></a>.</p>
<p><strong>MAINTENANCE END</strong></p>
<p>This definition for this table would be created in the schema file as follows:</p>
<pre>
 [users]
 user_id INT
 name    VARCHAR(64)
 area    CHAR(8)</pre>
<p>The</p>
<p>
</p>
<h2><a name="metadata">Metadata</a></h2>
<p>As mentioned previously, there are four metadata keys used for tables.  These
are <a href="#_field_order"><code>_field_order</code></a>, <a href="#_foreign_key"><code>_foreign_key</code></a>, <a href="#_unique"><code>_unique</code></a>, and <code>_rec_count</code>.</p>
<dl>
<dt><strong><a name="_field_order" class="item">_field_order</a></strong></dt>

<dd>
<p><strong>MAINTENANCE BEGIN</strong></p>
<p>Since hashes have an effectively arbitrary order for their keys, the
<a href="#_field_order"><code>_field_order</code></a> key is used to specify the order of the fields when the tables
are created in the database.  This is an array reference in the <code>$schema</code>
variable:</p>
<pre>
 _field_order =&gt; [qw/ user_id name area /]</pre>
<p>If the tables fields and the field order do not match, an exception will be
thrown when the program attempts to create the table.</p>
<p><strong>MAINTENANCE END</strong></p>
</dd>
<dt><strong><a name="_unique" class="item">_unique</a></strong></dt>

<dd>
<p><code>DBD::CSV</code> does not actually allow for PRIMARY KEYS.  Rather than try to write
a bunch of arcane and effectively useless code to allow for this, we have a
<code>UNIQUE</code> specifier.  Internally, this is merely an array reference specifying
the keys in the table for which we do not allow duplicate values.</p>
<p><strong>MAINTENANCE BEGIN</strong></p>
<p>In the example of the <code>users</code> table above, the table specification is as
follows:</p>
<pre>
 users =&gt; {
     user_id      =&gt; $int,
     name         =&gt; &quot;$varchar(64)&quot;,
     area         =&gt; &quot;$char(8)&quot;,
     _field_order =&gt; [qw/ user_id name area /],
     _unique      =&gt; [qw/ user_id /]
 }</pre>
<p><strong>MAINTENANCE END</strong></p>
<p>To make a user_id UNIQUE, we do the following:</p>
<pre>
 [users]
 user_id INT UNIQUE
 name    VARCHAR(64)
 area    CHAR(8)</pre>
<p>After all data has been added to the database, the program will validate all
unique fields and an error will be written to an error log if any duplicate
values are found in these fields.</p>
</dd>
<dt><strong><a name="_foreign_key" class="item">_foreign_key</a></strong></dt>

<dd>
<p>Despite arguments by some diehard MySQL supporters :), any real database
supports foreign key constraints.  Naturally, DBD::CSV does not.  The
<a href="#_foreign_key"><code>_foreign_key</code></a> value for a table is an array ref of array refs.  Each inner
array has the following format:</p>
<pre>
 field foreign_table foreign_field</pre>
<p>After data is added to the database, foreign key validation will occur.  For
each table that has a <code>_foreign_field</code> key, the program will look up the value
in <code>field</code> and ensure that <code>foreign_table</code> has a corresponding valud in
<code>foreign_field</code>.  Further, the <code>foreign_field</code> must be designated as a
<a href="#_unique"><code>_unique</code></a> field.  If these conditions are not met, an entry will be added to
the error log.</p>
<p>Consider a lookup table for a music CD database.  Each CD could potentially
have several artists and each artist could be present on several CDs.  Assuming
we have two tables named <code>artists</code> and <code>CDs</code>, with <code>artist_id</code> and <code>CD_id</code>
respectively, we could define the lookup table as follows:</p>
<p><strong>MAINTENANCE BEGIN</strong></p>
<p>Internal representation</p>
<pre>
 cd_artist =&gt; {
     artist       =&gt; $int,
     cd           =&gt; $int,
     _field_order =&gt; [qw/ artist_id cd_id /],
     _foreign_key =&gt;    [
         [ qw/ artist artists artist_id / ],
         [ qw/ cd CDs CD_id / ]
     ]
 }</pre>
<p><strong>MAINTENANCE END</strong></p>
<pre>
 [cd_artist]
 artist INT
 cd     INT
 _foreign_key artist artists artist_id
 _foreign_key cd CDs CD_id</pre>
<p>Note that the tables will still be created if a foreign key constraint is
violated and there is nothing in the <code>DBD::CSV</code> module to prevent this.  This
is for advisory purposes only.  Your database will function even if the foreign
key is not defined as <code>UNIQUE</code>, but you should probably take a look at your
schema to look for problems.</p>
</dd>
<dt><strong><a name="_record_count" class="item">_record_count</a></strong></dt>

<dd>
<p>This is an internal field used to track how many records were written for each
table.  After data is added to the database, a list of the number of records
written to each table will be displayed to the screen.  A <strong>WARNING</strong> message
will occur for all tables that have no records written.</p>
</dd>
</dl>
<p>
</p>
<hr />
<h1><a name="database_creation">DATABASE CREATION</a></h1>
<p>
</p>
<h2><a name="creating_a_schema">Creating a schema</a></h2>
<p>Let's take a look at a simple database and see how it's all put together. We
have three tables: CDs, artist, and cd_artist.  The last is a lookup table.
What follows is not intended to be an complete database.</p>
<pre>
 [CDs]
 CD_id         INT /^\d++$/
 CD_name       VARCHAR(30) /.+/
 # date will be in YYYY
 year_released CHAR(4)</pre>
<pre>
 [artist]
 artist_id     INT UNIQUE /^\d+$/
 artist_name   VARCHAR(30) /.+/</pre>
<pre>
 [cd_artist]
 artist        INT
 cd            INT
 _foreign_key artist artist artist_id
 _foreign_key cd CDs CD_id</pre>
<p>Note that when this schema is read that the lines are split on whitespace.
<code>DBD::CSV</code> does not allow whitespace in a table name.</p>
<p>Further, I would not recommend the naming convention that I have used.  Foreign
key field names should (IMHO) match the field names of the tables that they
should match.  I have made them distinct so that it's easier to distinguish
them in this example.</p>
<p>
</p>
<h2><a name="creating_the_raw_data_for_the_database">Creating the raw data for the database</a></h2>
<p>Now that we have our schema, we need to create the raw data.  The data file
contains all of the tables and data that will be added to the database.  Each
table is separated by &quot;\n.\n&quot;.  The data for a table will be read in chunks
that is split on the separator.  I suppose that it's possible for someone to
have such a large amount of data that this could be a problem, but if you do,
you should be using a real database.  As with the schema file, lines composed
of all white space or beginning with a sharp will be discarded.  Fields must be
separated with a comma and if a field contains a comma, it must be quoted.</p>
<p>As each chunk is read, the program looks for the first line which contains data
and assumes this to be a table name.  You may use either of the two following
statements to declare a tablename:</p>
<pre>
 table=artist</pre>
<p>or simply:</p>
<pre>
 artist</pre>
<p>Here is a small raw data file:</p>
<pre>
 table=artist
 #artist_id, artist_name
 1,Ovid
 2,Yello
 .</pre>
<pre>
 CDs
 # CD_id, CD_name,date_released
 1,One Second,1987
 2,&quot;Baby&quot;,1991
 3,&quot;Ovid's Greatest Hits, Volume Zero&quot;,0000
 .</pre>
<pre>
 cd_artist
 # artist, cd
 1,3
 2,1
 4,2</pre>
<p>Let's say we save the schema in a file called <code>schema.txt</code> and the data in
a file called <code>raw.txt</code> and we want to write this to a database called
<code>stuff</code>.  We'd use the following command:</p>
<pre>
 perl db_validate.pl -s schema.txt -r raw.txt -d stuff</pre>
<p>If a directory called <code>stuff</code> does not exist, it will be created for us.
Issuing that command, we get the following:</p>
<pre>
 All tables successfully created.</pre>
<pre>
 ERROR in table: 'CDs'
 Field: 'Ovid's Greatest Hits, Volume Zero' does not match data type: 'VARCHAR(30)'.
 Record: 3,&quot;Ovid's Greatest Hits, Volume Zero&quot;,0000</pre>
<p>Be examing our schema, we see that we have defined <code>CD_name</code> as <code>VARCHAR(30)</code>.
The field in question is over 30 characters long, so we change <code>CD_name</code> in
the schema to <code>VARCHAR(40)</code>.  Then we rerun the command.  We get the following
output:</p>
<pre>
 All tables successfully created.</pre>
<pre>
 3 records added to table 'CDs'.
 2 records added to table 'artist'.
 3 records added to table 'cd_artist'.</pre>
<pre>
 All data successfully added to database.
 Now validating unique fields.
 Now validating foreign key constraints.
 Now validating data against regular expressions.</pre>
<pre>
 WARNING
 3 errors were found while validating the data.  Errors were written to 'err.log'</pre>
<p>Hmm... what errors?  Opening the error.log, we see:</p>
<pre>
 Foreign Key constraint violated.  Table 'cd_artist', field 'artist', value '4' Not found in table 'artist', field 'artist_id'
 Foreign Key constraint violated.  Table 'cd_artist', field 'cd'. 'CD_id' not defined as UNIQUE in 'CDs'
 Table: 'CDs', field 'CD_id' had invalid pattern '^\d++$'. Discarded.</pre>
<p>Oops.  Let's change those in our schema.</p>
<pre>
 CD_id         INT UNIQUE /^\d+$/</pre>
<p>The regular expression, in this case, may seem superfluous, but it does two things for us:</p>
<pre>
 1.  It effectively ensures that we will have at least one digit in the id.
 2.  It ensures that all ids are positive.</pre>
<p>Also, we see that there is no artist with an id of 4.  That should be changed to 2.
Rerunning the program this time creates all of the tables and generates absolutely
no errors.</p>
<p>
</p>
<hr />
<h1><a name="validating_an_existing_database">Validating an existing database</a></h1>
<p>If you have an existing CSV database, define a schema for the database.  Then,
run the program as normal using the <code>-v</code> switch.  This tells the program that
we will be validating an existing database rather than creating a new one from
scratch.  If you wish to validate a CSV file which does not have the field
names as the first line (i.e., the first line of the file is also data), then
use the <code>-n</code> option to tell the program that the CSV files has &quot;no headers&quot;.</p>
<p><strong>WARNING</strong>: You want to backup the database or CSV file before doing this!
This program will copy the database or CSV file to the raw_data file and, if
all goes well, it will drop your tables (or CSV file) before proceeding.  Then
it will attempt to re-add the data to the files.  If it does not successfully
add the data, your data may very well be corrupted.</p>
<p>Also, the files or files created by validating <strong>will</strong> have a header, as
<code>DBD::CSV</code> automatically adds them, due to the way the module works.  If this
is a problem, you should remove them yourself.</p>
<p>When you have completed validating a database, the raw data fill will not be
unlinked.  I have left the file there if it's necessary to debug your
application.</p>
<p>I strongly recommend using this feature if you have created an application that
uses DBD::CSV.  Use your application, put it through its paces (with all the
tests you wrote, right?) and then use this application to test for data
corruption.</p>
<p>
</p>
<hr />
<h1><a name="copyright">COPYRIGHT</a></h1>
<p>Copyright (c) 2001 Curtis &quot;Ovid&quot; Poe.  All rights reserved.
This program is free software; you may redistribute it and/or modify it under
the same terms as Perl itself</p>
<p>
</p>
<hr />
<h1><a name="author">AUTHOR</a></h1>
<p>Curtis &quot;Ovid&quot; Poe &lt;<a href="mailto:poec@yahoo.com">poec@yahoo.com</a>&gt;
Address bug reports and comments to: <a href="mailto:poec@yahoo.com.">poec@yahoo.com.</a></p>
<p>
</p>
<hr />
<h1><a name="bugs">BUGS</a></h1>
<p>2001/11/10 There are no known bugs at this time.  However, I modified this pretty
heavily to get it &quot;production ready&quot;.  Please let me know if there are any issues
with it.</p>
<p>
</p>
<hr />
<h1><a name="see_also">SEE ALSO</a></h1>
<p><a href="/DBD/CSV.html">the DBD::CSV manpage</a>, <em>DBI</em>, <a href="/Text/CSV_XS.html">the Text::CSV_XS manpage</a> and <a href="/SQL/Statement.html">the SQL::Statement manpage</a>.</p>

</body>

</html>