Building Your First Data Science Applicatino in MongoDB

www.centralinventions.com
Building Your First Data
Science App in MongoDB
Robyn Allen | @enrobyn | MDBW 2017

Outline
Motivation
Math app summary
Database schema
PyMongo queries

Math app goals
Create a framework for student-accessible data science
Improve STEM learning outcomes
Increase math literacy!!!!!

Data
Difficulty
Speed of response
Timestamp
Result

Data
Difficulty
Speed of response
Timestamp
Result
Science
aptitude?
confidence?
fatigue?
improvement?

operand1
operator
operand2Is 2073 divisible by 3?

operand1: 2073
operator: "%"
operand2: 3

operand1: 2073
operator: "%"
operand2: 3
user_guess
correct

schema
detail
{
...,
{
"operand1" : 2073,
"user_guess" : false,
"correct" : false,
"operand2" : 3,
"start_time": NumberLong("1497055796831"),
"operator" : "%",
"end_time" : NumberLong("1497055798985")
},
...
}

{
...,
"all_problems" : [
{
"operand1" : 2073,
"correct" : false,
"operand2" : 3,
"operator" : "%",
},
...
schema
detail

{
...,
"all_problems" : [
{
"operand1" : 2073,
"correct" : false,
"operand2" : 3,
"operator" : "%",
},
{
"operand1" : 77,
"correct" : true,
"operand2" : 4,
"start_time" : NumberLong("1497055827450"),
"operator" : "%",
},

{
"all_problems" : [{
"operand1" : 14,
"correct" : true,
"operand2" : 5,
"operator" : "%",
},
{
"operand1" : 24,
"correct" : true,
"user_guess" : true,
"operand2" : 2,
"operator" : "%",
},
{
"operand1" : 69,
"correct" : true,
"operand2" : 2,
"operator" : "%",
},
{
"operand1" : 26,
"correct" : true,
"operand2" : 5,
"operator" : "%",
},
{
"operand1" : 67,
"correct" : true,
"operand2" : 2,
"operator" : "%",
},
{
"operand1" : 31,
"correct" : true,
"operand2" : 4,
"operator" : "%",
}
],
...

{
"_id" : ObjectId("593b42366523ec06eed182b9"),
"session_start" : NumberLong("1497055796716"),
"uuid" : "urn:uuid:55e72720-c0b1-4e81-89d6-ac1896b06661",
"subtopic" : "divisibility superpowers"
"all_problems" : [{
"operand1" : 14,
"correct" : true,
"operand2" : 5,
"operator" : "%",
},
{
"operand1" : 24,
"correct" : true,
"user_guess" : true,
"operand2" : 2,
"operator" : "%",
},
{
"operand1" : 69,
"correct" : true,
"operand2" : 2,
"operator" : "%",
},
{
"operand1" : 26,
"correct" : true,
"operand2" : 5,
"operator" : "%",
},
{
"operand1" : 67,
"correct" : true,
"operand2" : 2,
"operator" : "%",
},
{
"operand1" : 31,
"correct" : true,
"operand2" : 4,
"operator" : "%",
}
],

MongoDB quick-look
MongoDB is a NoSQL database
Data is stored in documents
The schema can change! (even between documents)
PyMongo is the recommended Python driver

Document model
Database
Collection
Document(s)
Collection
Document(s)
...

Document model
Database
Collection
Document(s)
Collection
Document(s)
documents < collections < databases

from pymongo import MongoClient
# SET UP THE CONNECTION
client = MongoClient("localhost", 27017)
db = client["aprender"]
mathcards = client["mathcards"]
users = client["users"]
collections

from pymongo import MongoClient
from secure import MONGO_USERNAME, MONGO_PASSWORD
# SET UP THE CONNECTION
client = MongoClient("localhost", 27017)
db = client["aprender"]
mathcards = client["mathcards"]
users = client["users"]
# AUTHENTICATE THE CONNECTION
client.aprender.authenticate(MONGO_USERNAME,
MONGO_PASSWORD,
mechanism='SCRAM-SHA-1')

# find_one() returns one mathcard --> DICTIONARY
this_card = db.mathcards.find_one()

# find() returns 1+ mathcard(s) --> CURSOR
all_cards = db.mathcards.find()

# find() returns 1+ mathcard(s) --> CURSOR
all_cards = db.mathcards.find()
for card in all_cards:
for key in card.keys():
problem_data = card[key]
do_some_stuff(problem_data)

Queries, projections, etc. are documents
A document is like a Python dictionary
Example:
{ "uuid": some_uuid }

A document is like a Python dictionary
Example:
Usage:
.find({ "uuid": some_uuid })
Queries, projections, etc. are documents

# get data from a certain user
some_uuid = "urn:uuid:3f810ea0-3d27-43cc-87d7-0501635b3000"
my_data = db.mathcards.find(
)

'''cards greater than or equal to a certain
timestamp
'''
todays_cards = db.mathcards.find(
{ "session_start":
{ "$gte": 1493753538942 }
} )

MongoDB Aggregation Pipelines
A list of one or more stages
Very similar to UNIX pipes
Documents pass from one stage to the next

OVERALL CONCEPT OF THE AGGREGATION PIPELINE
RESULTS OF INTEREST
SOME DOCUMENTS
$group
ALL DOCUMENTS
$match

RESULTS
# task1a
all docs docs w/ specified
start time
result: the total
number of docs which
entered this stage
$count
$match

$project
$match
all docs docs w/ specified
start time
result: docs w/ new
info (number of probs
solved, by session)
# task2a

$match example
pipeline1 = [
{"$match": { "session_start": { "$gte":
this_morning } } },
]
name of key criteria

.aggregate() syntax
cursor1 = db.mathcards.aggregate(pipeline1)
for doc in cursor1:
print(doc)

Let’s code!!!
github.com/enrobyn/pymongo-tutorial

$count example
pipeline1a = [
this_morning } } },
{"$count": "total_sessions_today" }
]
# task1a

$project example
pipeline1b = [
this_morning } } },
{"$project": {"session_start": 1}}
]
# task1b
name of key
display flag

$project
pipeline2a = [
this_morning } } },
{"$project": {"session_probs" :
{"$size": "$___________"} } }
]
# task2a

$project
pipeline2a = [
this_morning } } },
{"$size": "$___________"} } }
]
# task2a
name of the list of problems

$project
pipeline2a = [
this_morning } } },
{"$size": "$all_problems"} } }
]
# task2a
$all_problems

$group
pipeline2b = [
{"$match" : {"session_start" : {"$gte" :
this_morning } } },
{"$project" : {"session_probs" : {"$size" :
"$all_problems"} } },
{"$group" : {
"_id" : None,
"total_problems" : {"$sum" :
"_____________"}
}}
]
# task2b

$group
pipeline2b = [
{"$match" : {"session_start" : {"$gte" :
this_morning } } },
{"$group" : {
"_id" : None,
"$session_probs"}
}}
]
# task2b

$group
pipeline2b = [
{"$group" : {
"_id" : None,
"$session_probs"}
}}
]
# task3

$avg
pipeline4 = [
{"$match" : {"session_start" : {"$gte" : this_morning } } },
{"$project" : {"session_probs" : {"$size" : "$all_problems"} } },
{"$group" : {
"_id" : None,
"avg_num_probs" : {"______": "__________"}
}}
]
# task4
?

$avg
pipeline4 = [
{"$group" : {
"_id" : None,
"avg_num_probs" : {"$avg": "$session_probs"}
}}
]
# task4

$stdDevSamp
pipeline4 = [
{"$group" : {
"_id" : None,
"std_dev_num_probs" :
{"_________": "_________"}
}}
]
# task5
?

$stdDevSamp
pipeline4 = [
{"$group" : {
"_id" : None,
"std_dev_num_probs" :
{"$stdDevSamp": "$session_probs"}
}}
]
# task5

Individual work time
Search for tasks in the .py file
Take a moment to write one or more pipeline stages
Check end of file comments if stuck

Multi-stage aggregation pipelines
task6: Response time by operand2 [2,3,4,5,6,9] for one user
task7: Percent accuracy (“score”) by operand2 for one user
task8: Retrieve, for one user, operand2 w/ lowest score
task9: Retrieve, for one user, operand2 w/ fastest time
task10: Retrieve operand2 which challenged the most users

$match
# task6
all docs docs w/ a
certain uuid

$unwind
$match
# task6
all docs docs w/ a
certain uuid
now every array
element in
all_problems
is a doc!

$project
$unwind
$match
# task6
all docs docs w/ a
certain uuid
now every array
element in
all_problems
is a doc!
add two fields
to the docs
(suppress
others)

$group
$project
$unwind
$match
# task6
all docs docs w/ a
certain uuid
now every array
element in
all_problems
is a doc!
add two fields
to the docs
(suppress
others)

RESULTS
$group
$project
$unwind
$match
# task6
all docs docs w/ a
certain uuid
now every array
element in
all_problems
is a doc!
add two fields
to the docs
(suppress
others)
group by
operand2,
get time spent

$unwind # task6
pipeline6 = [
{ "$match" : # $match on uuid_of_interest
{ "$unwind" : # $unwind array of problems
{ "$project": {
"operand2": # use dot notation
"time_spent": # compute time spent
},
{ "$group":{
"_id": # group on operand2
"avg_time_spent": # compute $avg
}
}
]

$unwind # task6pipeline6 = [
{ "$match" : {"uuid" : uuid_of_interest } },
{ "$unwind" : "$all_problems" },
{ "$project": {
"operand2": "$all_problems.operand2",
"time_spent": {"$subtract": ["$all_problems.end_time",
"$all_problems.start_time"]},
"session_start":1,
"_id":0}
},
{ "$group":{
"_id": {"operand2": "$operand2"},
"avg_time_spent": {"$avg": "$time_spent"},
}
}
]

$addFields # task7
pipeline7a = [
{ "$match" : ...
{ "$unwind" : ...
{ "$group":{
"_id": # $group on operand2
"total_attempted": # $sum
"total_correct":
}
},
{ "$addFields":{
"percent_accuracy":
}
}
]

$group for task7 # task7
{"$group":{
"_id": "$all_problems.operand2",
"total_attempted": {"$sum":1},
"total_correct":
{"$sum": { "$cond":
["$all_problems.correct", 1, 0]
} }
}
}

$addFields for task7 # task7
{"$addFields":{
"percent_accuracy": {"$divide":
["$total_correct",
"$total_attempted"]
}
}
}

task8 hints # task8
We want to find the operand2
which had the lowest score...
What stage(s) could you add to
pipeline7 in order to solve this?

$sort and $limit # task8
pipeline7.extend([
{"$sort": {"percent_accuracy": 1}},
{"$limit": 1}
])

task9 hints # task9
We want to find the operand2
which had the fastest time...
What stage(s) could you add to
pipeline6 in order to solve this?

$sort and $limit # task9
pipeline6.extend([
{"$sort":
{"avg_time_spent": 1}},
{"$limit": 1}
]}

Check out pipeline7a... # task10
Hint: Add a $sort stage to 7a

Check out pipeline7a... # task10
Hint: Add a $sort stage to 7a
{"$sort" : {"percent_accuracy": 1} }

Results! # task10
{'percent_accuracy': 0.7724425887265136, 'total_correct': 740,
'total_attempted': 958, '_id': {'op2': 3}}

Conclusion
PyMongo = easy to learn
You can learn PyMongo
The aggregation pipeline enables you to run data science code
efficiently on your database servers without needing to move any
data

Resources
Asya Kamsky's talk! 4:30PM WED. in Grand Ballroom
“Powerful Analysis with the Aggregation Pipeline”
MongoDB University (free!)
https://university.mongodb.com/
Aggregation Pipeline Quick Reference
https://docs.mongodb.com/manual/meta/aggregation-quick-
reference/
MongoDB Day-long conferences

Operators useful for aggregation pipelines
$sort
$group
$map
$addFields
$let
$cond
$min
$max
$unwind
$limit
$project
$match
$push
$addToSet
$first
$sum
$eq
$divide
$multiply
$gt $lt $gte $lte

$cond
{"$cond":{
"if": {"$and":[ {"$eq": ["$$nth.label", "tryHarder"]},
{"$eq": ["$$nth.user_response", "yes"]},
{"$eq": ["$$nplus1.label", "tryHarder"]},
{"$eq": ["$$nplus1.user_response", "yes"]}
]},
"then": True,
"else": False
}}

{"$eq": ["$$nth.label", "tryHarder"]},
{"$eq": ["$$nth.user_response", "yes"]},
{"$eq": ["$$nplus1.label", "tryHarder"]},
{"$eq": ["$$nplus1.user_response", "yes"]}
$eq

{"$and": [ {},{},{},{} ]}
$and

{
"_id" : ObjectId("593b42366523ec06eed182b9"),
"session_start" : NumberLong("1497055796716"),
"uuid" : "urn:uuid:55e72720-c0b1-4e81-89d6-ac1896b06661",
"subtopic" : "divisibility superpowers"
"all_problems" : [...],
"interventions" : [
{
"deploy_time" : NumberLong("1497055817986"),
"label" : "tryHarder",
"user_response" : "no",
"dismiss_time" : 2939
}
}
71

Building Your First Data Science Applicatino in MongoDB

More Related Content

What's hot(20)

Similar to Building Your First Data Science Applicatino in MongoDB(20)

More from MongoDB(20)

Recently uploaded(20)

Building Your First Data Science Applicatino in MongoDB

Editor's Notes