-
Notifications
You must be signed in to change notification settings - Fork 20
/
Copy pathcvs-verify.html
425 lines (408 loc) · 18.2 KB
/
cvs-verify.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
<?xml version="1.0" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>db_validate.pl - Simple validation for DBD::CSV database creation</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<link rev="made" href="mailto:[email protected]." />
</head>
<body style="background-color: white">
<!-- INDEX BEGIN -->
<div name="index">
<p><a name="__index__"></a></p>
<ul>
<li><a href="#name">NAME</a></li>
<li><a href="#synopsis">SYNOPSIS</a></li>
<li><a href="#description">DESCRIPTION</a></li>
<li><a href="#schema">Schema</a></li>
<ul>
<li><a href="#table_fields">Table Fields</a></li>
<li><a href="#metadata">Metadata</a></li>
</ul>
<li><a href="#database_creation">DATABASE CREATION</a></li>
<ul>
<li><a href="#creating_a_schema">Creating a schema</a></li>
<li><a href="#creating_the_raw_data_for_the_database">Creating the raw data for the database</a></li>
</ul>
<li><a href="#validating_an_existing_database">Validating an existing database</a></li>
<li><a href="#copyright">COPYRIGHT</a></li>
<li><a href="#author">AUTHOR</a></li>
<li><a href="#bugs">BUGS</a></li>
<li><a href="#see_also">SEE ALSO</a></li>
</ul>
<hr name="index" />
</div>
<!-- INDEX END -->
<p>
</p>
<h1><a name="name">NAME</a></h1>
<p>db_validate.pl - Simple validation for DBD::CSV database creation</p>
<p>
</p>
<hr />
<h1><a name="synopsis">SYNOPSIS</a></h1>
<pre>
perl db_validate.pl -s schema.txt -r raw_data.txt -d data_dir</pre>
<p>
</p>
<hr />
<h1><a name="description">DESCRIPTION</a></h1>
<p>DBD::CSV is a useful module if you are forced to work with CSV files as a
database. Unfortunately, the module does no data validation. This
program allows you to define a schema, along with UNIQUE fields and foreign
key constraints. Further, basic data type validation for INT, CHAR, and
VARCHAR is supported. Oddly enough, DBD::CSV, while only allowing those
datatypes to be used in a <code>CREATE table</code> statement, does not actually check
to see whether or not the data you are inserting matches those datatypes.</p>
<p>If one supplies the -v option on the command line, then this program validates
and existing database. The -r option is then used to specify where the
raw_data will be written to. The -n ("no header") option may be used when
validating if you are validating a CSV file while does not have field names
listed as the first line of the file.</p>
<p>Much of the information provided in here is for those who wish to maintain and
extend this program. These sections are marked with <strong>MAINTENANCE BEGIN</strong> and
<strong>MAINTENANCE END</strong> and may be skipped if you do not care about this.</p>
<p>
</p>
<hr />
<h1><a name="schema">Schema</a></h1>
<p>Each database to be created and validated must have an associated schema
created by the programmer.</p>
<p>The schema location is passed to the program via the command line -s switch.
The schema is then read into the <code>$schema</code> variable in <code>&schema</code>. <code>$schema</code>
is a hash ref. Every key in the hash that does not begin with an underscore
(_) is assumed to be a table name. Each table is also a hashref. Every key
in a table hash that does not begin with an underscore is assumed to be a
field name. Keys beginning with underscores are metadata describing their
respectives tables. These keys are <a href="#_field_order"><code>_field_order</code></a>, <a href="#_foreign_key"><code>_foreign_key</code></a>,
<a href="#_unique"><code>_unique</code></a>, and <code>_rec_count</code>. These are used to validate the data in the
database.</p>
<p>The format of the schema file must be as follows (items enclosed in curly
braces are optional):</p>
<pre>
[tablename]
field DATATYPE {UNIQUE} {/regex/}
_foreign_key tablename.field foreigntable foreigntable.key</pre>
<p>Currently, the only datatypes supported are INT, CHAR, and VARCHAR. CHAR and
VARCHAR must have a number in parenthese following them which specifies the
maximum number of characters allowed. For example:</p>
<pre>
[users]
user_id INT UNIQUE
user_name VARCHAR(10)</pre>
<p>In the schema, blank lines and lines starting with a sharp (#) are skipped.
Fields in the database will be in the order listed. As soon as a new table
name is encountered, the previous table definition is assumed to be complete.</p>
<p>A regex may be supplied as the last item on the field definition line. If a
field value does not match the regex, a warning will be written to the error
log. For example to ensure that user ids are positive integers:</p>
<pre>
[users]
user_id INT UNIQUE /^\d+$/</pre>
<p>The regex <strong>must</strong> begin and end with a forward slash. You can use a regex such
as <code>/.+/</code> to enforce a <code>NOT NULL</code> requirement.</p>
<p>Also note that you may write <code>INTEGER</code> instead of <code>INT</code>, if you prefer.</p>
<p>
</p>
<h2><a name="table_fields">Table Fields</a></h2>
<p>Each field in a table may be of type INT, VARCHAR, or CHAR. These are the only
datatypes supported by DBD::CSV when creating tables. These datatypes are
stored in <code>$int</code>, <code>$varchar</code>, and <code>$char</code>, respectively. Let's examine the
following SQL <code>CREATE</code> statement and see how it's translated:</p>
<pre>
CREATE TABLE users (
user_id INTEGER UNIQUE,
name VARCHAR(64),
area CHAR(8)
)</pre>
<p><strong>MAINTENANCE BEGIN</strong></p>
<p>The following is the minimum legal specification for this table, as defined in
the <code>$schema</code> variable.</p>
<pre>
users => {
user_id => $int,
name => "$varchar(64)",
area => "$char(8)",
_field_order => [qw/ user_id name area /]
}</pre>
<p>Upon validating the data, and exception will be thrown if <code>user_id</code> does not
match the regular expression <code>/^-?\d+$/</code>.</p>
<p>The <code>name</code> would throw an exception if it exceeded the length of 64 and
<code>area</code> would throw an exception if the length did not match 8. When the table
is created, the order of fields in the table would match the order specified in
<a href="#_field_order"><code>_field_order</code></a>.</p>
<p><strong>MAINTENANCE END</strong></p>
<p>This definition for this table would be created in the schema file as follows:</p>
<pre>
[users]
user_id INT
name VARCHAR(64)
area CHAR(8)</pre>
<p>The</p>
<p>
</p>
<h2><a name="metadata">Metadata</a></h2>
<p>As mentioned previously, there are four metadata keys used for tables. These
are <a href="#_field_order"><code>_field_order</code></a>, <a href="#_foreign_key"><code>_foreign_key</code></a>, <a href="#_unique"><code>_unique</code></a>, and <code>_rec_count</code>.</p>
<dl>
<dt><strong><a name="_field_order" class="item">_field_order</a></strong></dt>
<dd>
<p><strong>MAINTENANCE BEGIN</strong></p>
<p>Since hashes have an effectively arbitrary order for their keys, the
<a href="#_field_order"><code>_field_order</code></a> key is used to specify the order of the fields when the tables
are created in the database. This is an array reference in the <code>$schema</code>
variable:</p>
<pre>
_field_order => [qw/ user_id name area /]</pre>
<p>If the tables fields and the field order do not match, an exception will be
thrown when the program attempts to create the table.</p>
<p><strong>MAINTENANCE END</strong></p>
</dd>
<dt><strong><a name="_unique" class="item">_unique</a></strong></dt>
<dd>
<p><code>DBD::CSV</code> does not actually allow for PRIMARY KEYS. Rather than try to write
a bunch of arcane and effectively useless code to allow for this, we have a
<code>UNIQUE</code> specifier. Internally, this is merely an array reference specifying
the keys in the table for which we do not allow duplicate values.</p>
<p><strong>MAINTENANCE BEGIN</strong></p>
<p>In the example of the <code>users</code> table above, the table specification is as
follows:</p>
<pre>
users => {
user_id => $int,
name => "$varchar(64)",
area => "$char(8)",
_field_order => [qw/ user_id name area /],
_unique => [qw/ user_id /]
}</pre>
<p><strong>MAINTENANCE END</strong></p>
<p>To make a user_id UNIQUE, we do the following:</p>
<pre>
[users]
user_id INT UNIQUE
name VARCHAR(64)
area CHAR(8)</pre>
<p>After all data has been added to the database, the program will validate all
unique fields and an error will be written to an error log if any duplicate
values are found in these fields.</p>
</dd>
<dt><strong><a name="_foreign_key" class="item">_foreign_key</a></strong></dt>
<dd>
<p>Despite arguments by some diehard MySQL supporters :), any real database
supports foreign key constraints. Naturally, DBD::CSV does not. The
<a href="#_foreign_key"><code>_foreign_key</code></a> value for a table is an array ref of array refs. Each inner
array has the following format:</p>
<pre>
field foreign_table foreign_field</pre>
<p>After data is added to the database, foreign key validation will occur. For
each table that has a <code>_foreign_field</code> key, the program will look up the value
in <code>field</code> and ensure that <code>foreign_table</code> has a corresponding valud in
<code>foreign_field</code>. Further, the <code>foreign_field</code> must be designated as a
<a href="#_unique"><code>_unique</code></a> field. If these conditions are not met, an entry will be added to
the error log.</p>
<p>Consider a lookup table for a music CD database. Each CD could potentially
have several artists and each artist could be present on several CDs. Assuming
we have two tables named <code>artists</code> and <code>CDs</code>, with <code>artist_id</code> and <code>CD_id</code>
respectively, we could define the lookup table as follows:</p>
<p><strong>MAINTENANCE BEGIN</strong></p>
<p>Internal representation</p>
<pre>
cd_artist => {
artist => $int,
cd => $int,
_field_order => [qw/ artist_id cd_id /],
_foreign_key => [
[ qw/ artist artists artist_id / ],
[ qw/ cd CDs CD_id / ]
]
}</pre>
<p><strong>MAINTENANCE END</strong></p>
<pre>
[cd_artist]
artist INT
cd INT
_foreign_key artist artists artist_id
_foreign_key cd CDs CD_id</pre>
<p>Note that the tables will still be created if a foreign key constraint is
violated and there is nothing in the <code>DBD::CSV</code> module to prevent this. This
is for advisory purposes only. Your database will function even if the foreign
key is not defined as <code>UNIQUE</code>, but you should probably take a look at your
schema to look for problems.</p>
</dd>
<dt><strong><a name="_record_count" class="item">_record_count</a></strong></dt>
<dd>
<p>This is an internal field used to track how many records were written for each
table. After data is added to the database, a list of the number of records
written to each table will be displayed to the screen. A <strong>WARNING</strong> message
will occur for all tables that have no records written.</p>
</dd>
</dl>
<p>
</p>
<hr />
<h1><a name="database_creation">DATABASE CREATION</a></h1>
<p>
</p>
<h2><a name="creating_a_schema">Creating a schema</a></h2>
<p>Let's take a look at a simple database and see how it's all put together. We
have three tables: CDs, artist, and cd_artist. The last is a lookup table.
What follows is not intended to be an complete database.</p>
<pre>
[CDs]
CD_id INT /^\d++$/
CD_name VARCHAR(30) /.+/
# date will be in YYYY
year_released CHAR(4)</pre>
<pre>
[artist]
artist_id INT UNIQUE /^\d+$/
artist_name VARCHAR(30) /.+/</pre>
<pre>
[cd_artist]
artist INT
cd INT
_foreign_key artist artist artist_id
_foreign_key cd CDs CD_id</pre>
<p>Note that when this schema is read that the lines are split on whitespace.
<code>DBD::CSV</code> does not allow whitespace in a table name.</p>
<p>Further, I would not recommend the naming convention that I have used. Foreign
key field names should (IMHO) match the field names of the tables that they
should match. I have made them distinct so that it's easier to distinguish
them in this example.</p>
<p>
</p>
<h2><a name="creating_the_raw_data_for_the_database">Creating the raw data for the database</a></h2>
<p>Now that we have our schema, we need to create the raw data. The data file
contains all of the tables and data that will be added to the database. Each
table is separated by "\n.\n". The data for a table will be read in chunks
that is split on the separator. I suppose that it's possible for someone to
have such a large amount of data that this could be a problem, but if you do,
you should be using a real database. As with the schema file, lines composed
of all white space or beginning with a sharp will be discarded. Fields must be
separated with a comma and if a field contains a comma, it must be quoted.</p>
<p>As each chunk is read, the program looks for the first line which contains data
and assumes this to be a table name. You may use either of the two following
statements to declare a tablename:</p>
<pre>
table=artist</pre>
<p>or simply:</p>
<pre>
artist</pre>
<p>Here is a small raw data file:</p>
<pre>
table=artist
#artist_id, artist_name
1,Ovid
2,Yello
.</pre>
<pre>
CDs
# CD_id, CD_name,date_released
1,One Second,1987
2,"Baby",1991
3,"Ovid's Greatest Hits, Volume Zero",0000
.</pre>
<pre>
cd_artist
# artist, cd
1,3
2,1
4,2</pre>
<p>Let's say we save the schema in a file called <code>schema.txt</code> and the data in
a file called <code>raw.txt</code> and we want to write this to a database called
<code>stuff</code>. We'd use the following command:</p>
<pre>
perl db_validate.pl -s schema.txt -r raw.txt -d stuff</pre>
<p>If a directory called <code>stuff</code> does not exist, it will be created for us.
Issuing that command, we get the following:</p>
<pre>
All tables successfully created.</pre>
<pre>
ERROR in table: 'CDs'
Field: 'Ovid's Greatest Hits, Volume Zero' does not match data type: 'VARCHAR(30)'.
Record: 3,"Ovid's Greatest Hits, Volume Zero",0000</pre>
<p>Be examing our schema, we see that we have defined <code>CD_name</code> as <code>VARCHAR(30)</code>.
The field in question is over 30 characters long, so we change <code>CD_name</code> in
the schema to <code>VARCHAR(40)</code>. Then we rerun the command. We get the following
output:</p>
<pre>
All tables successfully created.</pre>
<pre>
3 records added to table 'CDs'.
2 records added to table 'artist'.
3 records added to table 'cd_artist'.</pre>
<pre>
All data successfully added to database.
Now validating unique fields.
Now validating foreign key constraints.
Now validating data against regular expressions.</pre>
<pre>
WARNING
3 errors were found while validating the data. Errors were written to 'err.log'</pre>
<p>Hmm... what errors? Opening the error.log, we see:</p>
<pre>
Foreign Key constraint violated. Table 'cd_artist', field 'artist', value '4' Not found in table 'artist', field 'artist_id'
Foreign Key constraint violated. Table 'cd_artist', field 'cd'. 'CD_id' not defined as UNIQUE in 'CDs'
Table: 'CDs', field 'CD_id' had invalid pattern '^\d++$'. Discarded.</pre>
<p>Oops. Let's change those in our schema.</p>
<pre>
CD_id INT UNIQUE /^\d+$/</pre>
<p>The regular expression, in this case, may seem superfluous, but it does two things for us:</p>
<pre>
1. It effectively ensures that we will have at least one digit in the id.
2. It ensures that all ids are positive.</pre>
<p>Also, we see that there is no artist with an id of 4. That should be changed to 2.
Rerunning the program this time creates all of the tables and generates absolutely
no errors.</p>
<p>
</p>
<hr />
<h1><a name="validating_an_existing_database">Validating an existing database</a></h1>
<p>If you have an existing CSV database, define a schema for the database. Then,
run the program as normal using the <code>-v</code> switch. This tells the program that
we will be validating an existing database rather than creating a new one from
scratch. If you wish to validate a CSV file which does not have the field
names as the first line (i.e., the first line of the file is also data), then
use the <code>-n</code> option to tell the program that the CSV files has "no headers".</p>
<p><strong>WARNING</strong>: You want to backup the database or CSV file before doing this!
This program will copy the database or CSV file to the raw_data file and, if
all goes well, it will drop your tables (or CSV file) before proceeding. Then
it will attempt to re-add the data to the files. If it does not successfully
add the data, your data may very well be corrupted.</p>
<p>Also, the files or files created by validating <strong>will</strong> have a header, as
<code>DBD::CSV</code> automatically adds them, due to the way the module works. If this
is a problem, you should remove them yourself.</p>
<p>When you have completed validating a database, the raw data fill will not be
unlinked. I have left the file there if it's necessary to debug your
application.</p>
<p>I strongly recommend using this feature if you have created an application that
uses DBD::CSV. Use your application, put it through its paces (with all the
tests you wrote, right?) and then use this application to test for data
corruption.</p>
<p>
</p>
<hr />
<h1><a name="copyright">COPYRIGHT</a></h1>
<p>Copyright (c) 2001 Curtis "Ovid" Poe. All rights reserved.
This program is free software; you may redistribute it and/or modify it under
the same terms as Perl itself</p>
<p>
</p>
<hr />
<h1><a name="author">AUTHOR</a></h1>
<p>Curtis "Ovid" Poe <<a href="mailto:[email protected]">[email protected]</a>>
Address bug reports and comments to: <a href="mailto:[email protected].">[email protected].</a></p>
<p>
</p>
<hr />
<h1><a name="bugs">BUGS</a></h1>
<p>2001/11/10 There are no known bugs at this time. However, I modified this pretty
heavily to get it "production ready". Please let me know if there are any issues
with it.</p>
<p>
</p>
<hr />
<h1><a name="see_also">SEE ALSO</a></h1>
<p><a href="/DBD/CSV.html">the DBD::CSV manpage</a>, <em>DBI</em>, <a href="/Text/CSV_XS.html">the Text::CSV_XS manpage</a> and <a href="/SQL/Statement.html">the SQL::Statement manpage</a>.</p>
</body>
</html>