TextTech - Final Report - ACunningham
TextTech - Final Report - ACunningham
Project Report
The objective of our project Text Encoding for Course Data was to give students the ability to
query for information about university courses. Such functions are useful for students when
planning their class schedules and sorting through candidates. Though we use a relational
database rendered by SQLite via Python, this paper explores the question of how using a graph
database might impact the data structure and functionality of our search engine. I will begin by
question.
Collect
We began by choosing a small database consisting of course data with various parameters that
would be relevant to students, such as the title,course number, location, etc. Our database is
comprised of course data from “Courses from Reed College '', which we found on the University
of Washington’s XML Data Repository1 (a collection of publicly available datasets used for
research purposes). The database specified the following course parameters: registration
number, subject, course number, section, title, units, instructor, days, place, and start and end
times. However, we tweak the parameter breadth during the encoding process.
Prepare
In order to read the dataset and create a database, we first converted our XML dataset to a
dictionary to be inserted into our database using SQLite functionalities (passing insert
statements as a parameter to the execute() method). The database was divided into three
tables: courses, time, and place. This is because the database’s original data structure had
already separated course information from metadata like time and place. Because the original
1
http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/courses/reed.xml
Text Technology - Summer 2023 Andrea Cunningham (3594623)
database contained eleven parameters per course, we decided only to encode eight of the most
crucial, relevant parameters from the database (registration number, subject, title, units,
Access
Finally, we used XPath expressions in order to retrieve specific information from the
XML-encoded course data. However, in order to provide an interface for the user to enter their
queries, we built a web application via the Django framework. This way, a user can type their
query into the search bar and retrieve any of the eight parameters from the XML-encoded
course data file. Another useful feature of this framework is the ability to filter courses by the
time boundaries in which they occur—a sort of preservation of the database’s original metadata
structure. The web application then displays the results. An example of a query could simply be
the instructor’s name or the course subject, and returned would be a discrete list of courses
each with their eight attributes listed. Figure 1 shows an example of a search query containing
A graph database would be constructed based on the relations between entities (in this case,
courses), whereas our relational database simply organizes entities and their parameters into
columns. The reason why a graph database might be conducive to our object is that many
entities have overlapping features, and the parameters of our dataset range from general to
unique. For example, the subject ‘CHEM’ would be considered a general parameter in that
many courses in the database could share this attribute; while the course title “Field Biology of
category. Therefore, one could build a constellation of entities (courses) and their relations
(“Bacterial Pathogensis” and “Seminar in Biology” belong under “BIOL”). A benefit to such a
data structure is that it would reduce complexity. It would require much less code than it would
take to connect more entities with relations. Such would speed up the querying process. This
would also allow one to use a much larger dataset. Yet, a downside to this structure may be
sparseness of data. Some entities may not have enough relations with other entities, causing
outliers.