Social Informatics Data
             Grid
Cyberinfrastructure for Collaborative Research in the
      Neural, Social and Behavioral Sciences

                Bennett I. Bertenthal
                 Indiana University
               bbertent@indiana.edu
Infrastructure for Social and
              Behavioral Sciences
Goal:
  Compare, measure and search for patterns in structured, semi-
  structured, and heterogeneous data sets.


Challenge:
  Integrate information over time, place, and types of data


Needs:
  (1) Data interface (shared datasets & databases)
  (2) Service interface (shared tools for analysis)
  (3) Intellectual interface (shared problems & theories)
Primary Objectives

• Develop prototype of core facility for collecting multiple
  measures of time-synchronized data

• Develop integrated tools for storage, retrieval,
  annotation, and analyses of multiple data sets at
  different time scales

• Develop scripts for parallelizing code to run on grid
  clusters
What is SIDGrid?
Social Informatics Data Grid
             • A general purpose architecture
               for streaming data applications
               (e.g., video, audio, time series)
             • Built on well established
               database, multimedia and web
               and grid services standards
             • Time alignment in distributed
               heterogeneous datasets
                – Software and hardware based
                – Integrated with existing laboratory
                  time stamping and registration
                  techniques
             • Scalable
                – Number of datasets
                – Types of data
                – Multiple end user applications
Server




Client
Client Side
Client Side
• Leveraging efforts for annotation and analysis of multimodal data
   – Familiarity and Interoperability
        • Elan (Max Planck Institute for Psycholinguistics, The Netherlands)
        • Talkbank (Carnegie Mellon University, US)
        • Digital Replay System (Nottingham University, UK)
   – XML, Java
   – Cross platform interoperability
• Adding SIDGrid functionality to Elan
   – Minimally intrusive
        • Avoid complicated co-development w/ELAN team
    – Browsing SIDGrid data
    – Additional data types
    – Upload / Download to SIDGrid server
66 GB 5 mov 2 wav …
368 GB 23 mov 6 wav …
 5 GB 1 mov 0 wav …
21 GB 3 mov 12 wav …
 4 GB 9 mov 1 wav …
 4 GB 4 mov 1 wav …
 1 GB 0 mov 2 wav …
945 GB 1 mov 66 wav …
 8 GB 3 mov 0 wav …
20 GB 13 mov 2 wav …
Server Side
.mov    .wav        .eaf         GB

10      0      0           45

 4     30      0           20

 2     2       1           3

12     100         9       200

 1     1       1           1

 6     2       0           12

400     0       1          1001

 0     666      1          312

 0     0       13          0.1

 0     0       0           0.0

18      4      0           66
Search and Query
                   (4,000 projects)
• Data Files
  –   Names
  –   Keywords
  –   Attributes (keyword-value)
  –   Date
  –   Type (Elan, Chat)
• Contents of Files
  – Metadata
  – Tier
  – Annotations
Server Side
• Web services
  – Query
  – Data download / upload

• Portal interface
   – Security
   – Data and metadata browsing
   – Preview
   – Tags, attributes
   – Projects
   – Groups
   – Search
   – Data transformation using grid resources
Science Gateway
What Is The TeraGrid?
                                    (circa 2006)
 75 Teraflops (trillion calculations per second)            • 16 Supercomputers - 9 different types, multiple sizes
 = 12,500 faster than all 6 billion humans on
 earth each doing one calculation per second
                                                            • World’s fastest network
                                                            • Globus Toolkit and other middleware providing single
                                                      ANL
                                                              login, application management, data movement, web
30 Gigabits per second to large sites                         services
= 20-30 times major university connections
= 30,000 times my home broadband
= 1 full length feature film per second

                         LA                                       Starlight                   Atlanta




                  SDSC                        TACC   NCSA    PU           IU   PSC                  ORNL
Scripts for Running Jobs on Grid

• Matlab (high-level language and interactive environment for peforming
   computationally intensive tasks)

• R (software environment for statistical computing and graphics)
• Praat (software for acoustic analysis)
• Free Surfer (automated tools for reconstruction of the brain’s cortical
   surface from structural MRI data)

• AFNI (programs for processing, analyzing, and displaying FMRI data)
• SUMA (adds cortical surface based functional imaging analysis to the
   AFNI suite of programs)
Advantages of Grid Computing

• Vastly expanded computing and storage
• Reduced effort as needs scale up
• Improved resource utilization; lower costs
• Facilities and models for collaboration
• Sharing of tools, data, and procedures and
  protocols
• Recording, assessment and reuse of complex
  tasks
Lessons Learned

• Fast prototyping vs production quality software
   – After one year of development, no product available for user
     feedback
   – Optimal design vs practical design
• Public vs private website
   – Need for dissemination
   – Need for security and protection of user groups and data
• Tools for diverse user groups with varying degrees of
  technical expertise
   – Non-intuitive interface with minimal user support
       • Importance of user manuals, technical support, and FAQs
• Multiple levels of privacy and confidentiality dictated by
  type of data and informed consent
If you build it, will they come?

• Dissemination of SIDGrid
   – Website and movie
   – Invited workshops at UofC and IU
   – Pre-conference workshops
• Start-up is time consuming
   – Scale of most projects conducted by social scientists does not
     justify time to learn web services and tools
   – Added value for larger, collaborative projects requires shift in
     goals and organization of research
• Resistance to data sharing
   – Original proposal required that all data stored on SIDGrid
     servers would be publicly available
Objections to Data Sharing

•   It’s my data!
•   Protection of confidentiality and anonymity
•   Need to first establish standards for coding and analysis
•   Reporting of misleading and confusing findings
•   Raw data but not coded data should be shared
    – Annotation and coding is very time consuming and should not
      become available to others
• If availability of web and software tools were contingent
  on sharing data, most users would opt out
Questions

Bertenthal

  • 1.
    Social Informatics Data Grid Cyberinfrastructure for Collaborative Research in the Neural, Social and Behavioral Sciences Bennett I. Bertenthal Indiana University [email protected]
  • 2.
    Infrastructure for Socialand Behavioral Sciences Goal: Compare, measure and search for patterns in structured, semi- structured, and heterogeneous data sets. Challenge: Integrate information over time, place, and types of data Needs: (1) Data interface (shared datasets & databases) (2) Service interface (shared tools for analysis) (3) Intellectual interface (shared problems & theories)
  • 3.
    Primary Objectives • Developprototype of core facility for collecting multiple measures of time-synchronized data • Develop integrated tools for storage, retrieval, annotation, and analyses of multiple data sets at different time scales • Develop scripts for parallelizing code to run on grid clusters
  • 4.
  • 6.
    Social Informatics DataGrid • A general purpose architecture for streaming data applications (e.g., video, audio, time series) • Built on well established database, multimedia and web and grid services standards • Time alignment in distributed heterogeneous datasets – Software and hardware based – Integrated with existing laboratory time stamping and registration techniques • Scalable – Number of datasets – Types of data – Multiple end user applications
  • 7.
  • 8.
  • 9.
    Client Side • Leveragingefforts for annotation and analysis of multimodal data – Familiarity and Interoperability • Elan (Max Planck Institute for Psycholinguistics, The Netherlands) • Talkbank (Carnegie Mellon University, US) • Digital Replay System (Nottingham University, UK) – XML, Java – Cross platform interoperability • Adding SIDGrid functionality to Elan – Minimally intrusive • Avoid complicated co-development w/ELAN team – Browsing SIDGrid data – Additional data types – Upload / Download to SIDGrid server
  • 11.
    66 GB 5mov 2 wav … 368 GB 23 mov 6 wav … 5 GB 1 mov 0 wav … 21 GB 3 mov 12 wav … 4 GB 9 mov 1 wav … 4 GB 4 mov 1 wav … 1 GB 0 mov 2 wav … 945 GB 1 mov 66 wav … 8 GB 3 mov 0 wav … 20 GB 13 mov 2 wav …
  • 12.
  • 13.
    .mov .wav .eaf GB 10 0 0 45 4 30 0 20 2 2 1 3 12 100 9 200 1 1 1 1 6 2 0 12 400 0 1 1001 0 666 1 312 0 0 13 0.1 0 0 0 0.0 18 4 0 66
  • 16.
    Search and Query (4,000 projects) • Data Files – Names – Keywords – Attributes (keyword-value) – Date – Type (Elan, Chat) • Contents of Files – Metadata – Tier – Annotations
  • 17.
    Server Side • Webservices – Query – Data download / upload • Portal interface – Security – Data and metadata browsing – Preview – Tags, attributes – Projects – Groups – Search – Data transformation using grid resources
  • 19.
  • 21.
    What Is TheTeraGrid? (circa 2006) 75 Teraflops (trillion calculations per second) • 16 Supercomputers - 9 different types, multiple sizes = 12,500 faster than all 6 billion humans on earth each doing one calculation per second • World’s fastest network • Globus Toolkit and other middleware providing single ANL login, application management, data movement, web 30 Gigabits per second to large sites services = 20-30 times major university connections = 30,000 times my home broadband = 1 full length feature film per second LA Starlight Atlanta SDSC TACC NCSA PU IU PSC ORNL
  • 24.
    Scripts for RunningJobs on Grid • Matlab (high-level language and interactive environment for peforming computationally intensive tasks) • R (software environment for statistical computing and graphics) • Praat (software for acoustic analysis) • Free Surfer (automated tools for reconstruction of the brain’s cortical surface from structural MRI data) • AFNI (programs for processing, analyzing, and displaying FMRI data) • SUMA (adds cortical surface based functional imaging analysis to the AFNI suite of programs)
  • 25.
    Advantages of GridComputing • Vastly expanded computing and storage • Reduced effort as needs scale up • Improved resource utilization; lower costs • Facilities and models for collaboration • Sharing of tools, data, and procedures and protocols • Recording, assessment and reuse of complex tasks
  • 27.
    Lessons Learned • Fastprototyping vs production quality software – After one year of development, no product available for user feedback – Optimal design vs practical design • Public vs private website – Need for dissemination – Need for security and protection of user groups and data • Tools for diverse user groups with varying degrees of technical expertise – Non-intuitive interface with minimal user support • Importance of user manuals, technical support, and FAQs • Multiple levels of privacy and confidentiality dictated by type of data and informed consent
  • 28.
    If you buildit, will they come? • Dissemination of SIDGrid – Website and movie – Invited workshops at UofC and IU – Pre-conference workshops • Start-up is time consuming – Scale of most projects conducted by social scientists does not justify time to learn web services and tools – Added value for larger, collaborative projects requires shift in goals and organization of research • Resistance to data sharing – Original proposal required that all data stored on SIDGrid servers would be publicly available
  • 29.
    Objections to DataSharing • It’s my data! • Protection of confidentiality and anonymity • Need to first establish standards for coding and analysis • Reporting of misleading and confusing findings • Raw data but not coded data should be shared – Annotation and coding is very time consuming and should not become available to others • If availability of web and software tools were contingent on sharing data, most users would opt out
  • 30.