0% found this document useful (0 votes)
10 views

Data Management

Uploaded by

Housedeal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Data Management

Uploaded by

Housedeal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Data Management and Use

of Excel
Joe Hagan
School of Public Health
Data Management Guidelines
• Research projects often involve the
collection of a large volume of data

• The data have then to be processed and


analyzed
– results and summaries will be published or
presented in some form
• Requires a well-defined system of data
management
Stages of Data Management in a
Research Project
• The raw data have to be entered into the
computer, and checked for accuracy
• The data have then to be organized into
an appropriate form for analysis
– often in different ways, depending on the
analysis
• The data have to be archived
– remain available throughout subsequent
phases of a project, and afterwards
Software for handling data
• Database (DBMS) packages
– Access, EpiInfo
• Statistics packages
– SAS, SPSS, Stata
• Spreadsheet packages
– Excel, Lotus-123
• Word Processors
– Word, WordPerfect or even text editors like
Notepad
Free Software
EpiInfo (available for free download):
http://www.cdc.gov/epiinfo/

Allows users to:


• rapidly develop a questionnaire or form
• customize the data entry process
• check data (including double entry)
• analyze data
Layout of Data
• These various types of software all handle
"rectangles" of data

– Each row refers to a unique observation or


case (e.g. a patient/subject)

– Each column refers to variable (e.g. gender,


type of insurance, procedure)
Simple Example of Data in Excel
Spreadsheets (e.g. Excel)
• Simplest to use
• Often automatically chosen
– Familiar
– Widespread
– Flexible
• Flexibility can result in poor data entry and
management
Other Software
• More consideration should be given to
alternative software for data entry
• Access forms can be developed that
facilitate easy and standardized data entry
– talk to statistician prior to data collection
• SPSS has special modules for data entry
• Access, SAS and SPSS have tools for
data checking
• Access relational databases easily created
Database Structure
• Flat: all the data exist at a single level and
can be held in one database table
– previous “simple” example
• Relational Databases: uses multiple,
linked tables to hold all the data
– one table contains the “key” variable that is
used to link information to other tables
– e.g. “Patient ID number” serves to link
demographics to clinical data
Table with a “Key” Variable
“Key” Variable is Link
Queries pull data together
Relational Databases
• Save memory space and data entry time
by reducing the amount of redundant
information
• Queries used to pull together information
from multiple tables linked by key
variable(s)
• Easily created in Access
– as widely available as Excel
Using Excel
• Experience has shown that most
researchers use Excel for data entry

• When using Excel, there are some


strategies that should be used to:
– improve data quality
• make easier to analyze
– facilitate accurate data entry
Excel Bad Example
Problems in Excel Example
• Two of the names under the "species" heading
have been typed slightly differently for the same
species
• The variable "rcd“ has observations that will
cause problems when the data are transferred to
a statistics package for analysis
– row 2 has two measurements entered in one cell
– in row 10, the cell reports that the plant is dead
instead of having a numerical value
Avoiding Problems
• Many errors can be avoided by thinking
about the layout of the data in the
spreadsheet before starting data collection

• Good to consider the analyis when


organizing the data in the spreadsheet
Worse Example
Good Example
Data Entry Recommendations
• Unique identifier
• Freezing panes
• Drop-down lists
• Data validation
• Adding comments to cells
• Formatting cells
• Forms
Unique Identifier
• One of the variables entered should give a
unique record number
– serves as an ID (only appears once)
– the good example above has an extra
column, named “plot” (has been calculated
as: plot=100*block+plotwb)
Freezing Panes in Excel
When entering data, it is useful to be
able to keep the headings of columns
always visible as you scroll down the
screen
1. Highlight one row below the row you
want to freeze
• often a column heading
2. Click “Window > Freeze Panes”
• undo by clicking “Window -> Unfreeze Panes”
Drop-Down Lists
• When the same text string is entered
many times, typing/spelling errors
inevitably occur

• Drop-down lists can be used to avoid such


errors

• Ensures standardized and consistent data


entry
Creating Drop-Down Lists in Excel
First, type all choices to be included in
the drop-down list in a single column
– e.g., for the previous example, the five
species names for block 1 are entered into
cells D2:D6
– The list of choices must be maintained, so
the list should not be in the actual column
where data in entered
– You can “hide” the list later
Creating Drop-Down
After creating the list field:
1. Select the cells to have drop-down lists
applied
• Can apply drop-down list to entire column an
later “unapply” to specific part
2. Click “Data > Validation > Allow: > List”
3. For the “Source” of the list, highlight the
choices already typed in the list , then
click “OK”
Drop-Down Lists (continued)
Drop-Down Lists (continued)
• Once the drop-down list has been created,
selecting a cell in that column will bring up
an indicator triangle on the right side of the
cell

• Clicking on this will display the drop-down


list so that an appropriate selection can be
made from the list
Drop-Down Lists (continued)
Hiding List Fields for Drop-Down
Lists
• To hide the lists so only the actual data
fields are displayed:
“Format > Row > Hide”
• To unhide lists:
“Format > Row > Unhide”
• Save both versions
– can’t unhide lists after making changes to the
spreadsheet
Hiding List Fields (continued)
Unhiding List Fields
Data Validation
• Validation checks can be set on ranges of
cells within the spreadsheet
– could be an entire column/row, several
columns/rows, or just a single cell

• The validation rules apply when new data


are entered
Data Validation (continued)
• Range checks for numerical data can be
set up in Excel

• For the previous example, suppose the


measurements recorded for the variable
“rcd” are expected to fall between 10 and
26
Setting Up a Range Check
• Highlight the cells to which the range
check is to be applied

– cells E2 to E21 in the previous example

– only the data cells are highlighted, not the


variable name at the top
• If want to apply to entire column you can remove
validation rules from the column heading
Example of Setting Up a Range
Check
1. Click “Data > Validation “
2. Select the “Settings” tab (if not already chosen)
3. for “Allow:” choose “Decimal” (or whatever
appropriate)
4. For “Data” choose “Between” (or whatever
appropriate)
5. Set the Minimum and Maximum
– Minimum = 10 and Maximum = 26 in the previous
example
6. Click “OK”
Setting Up a Range Check
(continued)
Setting Up a Range Check
(continued)
While the “Data Validation” dialog box is
being used to set up the range check,
you can also set up:

1. An Input Message

2. An Error Alert
Setting Up a Range Check with an
Input Message
Input Messages:

• Are displayed when a cell to which the


message has been applied is selected

• E.g. Remind the data-entry person of the


range of values allowed (or expected)
– Any type of message could be used
Setting Up a Range Check with an
Input Message (continued)
To create an Input Message:

1. Click “Data > Validation”


2. Select the “Input Message” tab
3. Type in the desired “Title” and “Input
Message”
4. Click “OK”
Setting Up a Range Check with an
Input Message
Setting Up a Range Check with an
Input Message (continued)
Setting Up a Range Check with an
Error Alert Message
• Error Alert Messages are displayed when
a value outside the range is typed
• To create an Error Alert Message:
1. Click “Data > Validation”
2. Select the “Error Alert” tab
3. Select a “Style” from the drop-down list
4. Type in the desired ”Title” and “Error
Message”
5. Click “OK”
Setting Up a Range Check with an
Error Alert Message (continued)
Setting Up a Range Check with an
Error Alert Message (continued)
Adding Comments to Cells
• Excel has a facility for adding comments to
a cell

• The comments differ from values within


the cell

• Useful for any unusual observations or


questions concerning a particular data
value
Adding Comments to Cells
(continued)
• Recall the example with the data for plot
101 for “rcd” where two values (12.7, 13.3)
were entered on the data recording sheet
• Suppose the researchers decided to
calculate the mean of the two values and
add a comment to the cell
– If several plots had two values recorded, two
columns of “rcd” data could have been
entered with a third column used to calculate
the mean
Adding Comments to Cells
To add a comment to a cell:

1. Highlight the cell to which the comment


is to be added

2. Click “Insert > Comment”

3. Type in the desired comment


Adding Comments to Cells
(continued)
Adding Comments to Cells
(continued)
Adding Comments to Cells
(continued)
• After the comment is added:

• The cell will now show a red tab in the upper


right corner

• The comment will be displayed when the cell


is selected
Removing Comments from Cells
To remove a comment from a cell:
– e.g., when a query has been resolved and
the correct value has been entered

1. Highlight the cell

2. Right-click “Delete Comment”


Formatting Cells
• Excel offers many formats that can be
applied to cells

• Only one example of one kind of number


format was be shown
– Note that many more formats are available
Formatting Cells (continued)
• Excel suppresses trailing zeros by
default
– e.g. “13.0” is displayed as “13”

• We can change this so that 1 (or more if
desired) decimal places are displayed
Formatting Cells (continued)
1. Highlight the cells to have decimals
displayed
2. Click “Format > Cells”
3. For “Category” select “Number”
4. For “Decimal places” choose “1”
5. Click “OK”
Formatting Columns (continued)
Formatting Cells (continued)
• Excel offers many other options for
formatting cells
• Only one example of one kind of number
format has been shown
• Explore the other options by clicking
“Format > Cells” and looking at all of the
tabs
– i.e. Alignment, Font, Border,Patterns and
Protection
Forms
• Data entry forms facilitate data entry

• Easier to enter data in a list

• Excel has a built-in Data Form


– 32 fields maximum
Forms for Larger Datasets
• If a data entry form is desired but you have
more than 32 fields:
1.Use Access
2.Use multiple Excel spreadsheets
• Can merge together later
• Be sure to include unique identifier in each
spreadsheet so can link spreadsheets together
3.Download free enhanced data form:
http://j-walk.com/ss/dataform/index.htm
Forms in Excel
1. Select the fields (including column
headings) for which you want to use a
data entry form

2. “Data > Form”

3. Click “New” to add data to the next


observation
Forms in Excel (continued)
Forms in Excel (continued)
Note:
1. No entry field for calculated columns
• E.g. “Paid by Insurance” = “Cost” – “Out of
Pocket”
2. Ctrl + ; is a shortcut key to enter the
current “Date”
3. Drop-down lists created do not appear in
the form, but you will not be allowed to
enter something not in the drop-down list
Forms in Excel (continued)
• Much more sophisticated “UserForms”
(e.g. with drop-down lists) can be created
using the Visual Basic Editor
– Similar to Access

http://www.contextures.com/xluserform
02.html
Data Auditing
To check data that:

1. Has already been entered

2. Has had validation rules (discussed


above) applied or changed after data
entry
Data Auditing (continued)
To audit data that has been entered and
then had validation rules applied:
1. Click “Tools > Formula Auditing > Show
Formula Auditing Toolbar”
2. On the “Formula Auditing” toolbar, click the
“Circle Invalid Data” icon
Data Auditing (continued)
Data Auditing (continued)
• To remove the red circles from the invalid
data click the “Clear Validation Circles”
icon on the “Formula Auditing” toolbar


Metadata
Essential if the dataset is to be integrated
with datasets from other studies, or is to
be passed to someone else for analysis

• Where the data came from


• When the data was collected
• What the data represents
• Units of measurements used
Metadata (continued)
• Adding rows and columns to the
spreadsheet before the body of data can
be helpful

• The extra rows will store documentation


that provides background information
about the data
– i.e., the metadata
Input Areas of an Excel
Spreadsheet
Data Collection Form
Using Multiple Sheets
• An alternative to what is described above is to
put the Page Information on a separate sheet in
the Excel Workbook

• Convenient when there is a lot of information at


the dataset level

• May still have a small "Page" section in each


data sheet describing the type of measurements
entered in that sheet
Excel’s Limitations
• No easy facilities for skipping fields
conditional on the entry of initial codes
• Limited graphical capabilities
– Excel graphics intended for presentation
– No boxplots
– Lacking other exploratory techniques that
could assist in data scrutiny
• Can’t handle too many columns in one
sheet
Data Entry and Checking
The ultimate aim should be a fully-
documented archive of checked, correct,
reliable data that can be subjected to
scientific scrutiny without raising any
doubts in the minds of subsequent
researchers
Make Data Entry as Simple as
Possible
• In a replicated experiment it should never
be necessary to type variety names or
long treatment codes for each patient
– a single letter or number is usually sufficient
– then, either the data entry system can insert
the full code, or the full names may be
available in a separate, "look-up" file
• Simplifying the keying process will speed
the task, make it less tedious and hence
also less error-prone
Enter Data ASAP
• The data should be entered as soon as
possible after data collection
– not so large and daunting as doing all at the
end
– helps checking, some checks can indicate
unusually large changes from the previous
value to allow immediate verification
– feedback of any problems that are noticed to
field data collectors can help maintain the
data quality
Double Entry
• The ideal way to ensure accurate data
entry
• Two different people enter all of the data
separately
– two different databases
• Software is used to identify discrepancies
between the two data sets
– inconsistencies resolved to create the final
database
Only One “Master Copy"
(if Double Entry not used)
• Problems can arise if multiple copies are
kept of the same data in different formats
• Master copy will increase in size as data
accrues
– changes through the course of data entry
• Process should be documented
• Consistent "version-numbering" system
should used by all people making
modifications to the data
Backing up Data
• Essential to develop a system for regular
"back-ups" (copies) of the data
– not backing up may result in losing data
• Back up copies of data should be made on
separate media from the original master
copy
– e.g., another computer, on CDs, on a network, etc.
• The back up copy should be dated
– date of last revision
Data Checking
• Checking is done both at the time of
keying and afterwards

• The logical checking phase should be


done by trained staff who understand the
nature of the data
Logical Checking
• Checks to rule out illogical data
– e.g. pregnant males, or minimum greater than
maximum temperature, clinic visits recorded
as dates in the future, range checks, etc.
• Usually involves preliminary analyses,
plotting, etc.
• Reasoned decisions can be made about
what to do with unusual observations
Audit trail
• Complete record of changes to the data
and decisions made about the data and
the analysis
– like a log book
• Requirement of the scientific method
– must ensure the data management work is
repeatable
• Facilitates subsequent writing of reports
on the data and answering data queries
Audit trail (continued)
• Important to record everything you do at
the time that you do it
– recollections are always poor at a later stage

• When errors are found and changes are


made to the master copy of the data, a
note should be made
– old and new values recorded
Audit trail (continued)
• Keep notes on the analyses done
– including the preliminary analyses done for
checking purposes

• Writing down the names of all files created


– Including back ups

• Every entry in the log-book should be


dated and initialed
Archiving Data
• All data and programs from a research
project must be archived in such a way
that they are safe and can be accessed by
a subsequent user

• Use a consistent directory structure and


naming convention for computer files
Archiving Data (continued)
• The archive should give access to all the
information about the study
– during the project, information is located in many
places (e.g. the computer, on paper and other media
and in the minds of the research team)
• The archive need not all be computerized, but it
should include all the relevant information
– The source/location of information not archived
electronically should be recorded in the electronic
archive
Archiving Data (continued)
If a proper archiving scheme is not used,
when researchers leave:

• They might take the only copy of their part


of the data (the data is lost)

• Knowledge of the study protocol is lost


resulting in great difficulty when new
investigators join the project
Confidential Data
• Good idea to password protect confidential
data files
– warn analyst that file is protected

• Patient names, SSN’s etc. should always


be removed
Prevent Modification of Data
• To prevent others from modifying the data
– but they can save changes under a different
file name

“Tools” > “Options” > “Security” >


“Password to Open” > [enter password]”
Prevent Viewing of Data
• To prevent others from viewing the data:

“Tools” > “Options” > “Security” >


“Password to Open” > [enter password]”
References
University of Reading: Statistical
Services Center:
http://www.reading.ac.uk/ssc/

Microsoft Office Applications:


http://www.contextures.com/index.ht
ml

You might also like