www.centralinventions.com
Building Your First Data
Science App in MongoDB
Robyn Allen | @enrobyn | MDBW 2017
Outline
Motivation
Math app summary
Database schema
PyMongo queries
Math Literacy
+
Data Literacy
Math app goals
Create a framework for student-accessible data science
Improve STEM learning outcomes
Increase math literacy!!!!!
screenshot
Data
Difficulty
Speed of response
Timestamp
Result
Data
Difficulty
Speed of response
Timestamp
Result
Science
aptitude?
confidence?
fatigue?
improvement?
Is 2073 divisible by 3?
operand1
operator
operand2Is 2073 divisible by 3?
operand1: 2073
operator: "%"
operand2: 3
operand1: 2073
operator: "%"
operand2: 3
user_guess
correct
schema
detail
{
...,
{
"operand1" : 2073,
"user_guess" : false,
"correct" : false,
"operand2" : 3,
"start_time": NumberLong("1497055796831"),
"operator" : "%",
"end_time" : NumberLong("1497055798985")
},
...
}
{
...,
"all_problems" : [
{
"operand1" : 2073,
"user_guess" : false,
"correct" : false,
"operand2" : 3,
"start_time": NumberLong("1497055796831"),
"operator" : "%",
"end_time" : NumberLong("1497055798985")
},
...
schema
detail
{
...,
"all_problems" : [
{
"operand1" : 2073,
"user_guess" : false,
"correct" : false,
"operand2" : 3,
"start_time": NumberLong("1497055796831"),
"operator" : "%",
"end_time" : NumberLong("1497055798985")
},
{
"operand1" : 77,
"correct" : true,
"user_guess" : false,
"operand2" : 4,
"start_time" : NumberLong("1497055827450"),
"operator" : "%",
"end_time" : NumberLong("1497055828629")
},
{
"all_problems" : [{
"operand1" : 14,
"correct" : true,
"user_guess" : false,
"operand2" : 5,
"start_time" : NumberLong("1497055834697"),
"operator" : "%",
"end_time" : NumberLong("1497055835953")
},
{
"operand1" : 24,
"correct" : true,
"user_guess" : true,
"operand2" : 2,
"start_time" : NumberLong("1497055828630"),
"operator" : "%",
"end_time" : NumberLong("1497055830491")
},
{
"operand1" : 69,
"correct" : true,
"user_guess" : false,
"operand2" : 2,
"start_time" : NumberLong("1497055824300"),
"operator" : "%",
"end_time" : NumberLong("1497055825997")
},
{
"operand1" : 26,
"correct" : true,
"user_guess" : false,
"operand2" : 5,
"start_time" : NumberLong("1497055796831"),
"operator" : "%",
"end_time" : NumberLong("1497055798985")
},
{
"operand1" : 67,
"correct" : true,
"user_guess" : false,
"operand2" : 2,
"start_time" : NumberLong("1497055814628"),
"operator" : "%",
"end_time" : NumberLong("1497055816652")
},
{
"operand1" : 31,
"correct" : true,
"user_guess" : false,
"operand2" : 4,
"start_time" : NumberLong("1497055802959"),
"operator" : "%",
"end_time" : NumberLong("1497055804802")
}
],
...
{
"_id" : ObjectId("593b42366523ec06eed182b9"),
"session_start" : NumberLong("1497055796716"),
"uuid" : "urn:uuid:55e72720-c0b1-4e81-89d6-ac1896b06661",
"subtopic" : "divisibility superpowers"
"all_problems" : [{
"operand1" : 14,
"correct" : true,
"user_guess" : false,
"operand2" : 5,
"start_time" : NumberLong("1497055834697"),
"operator" : "%",
"end_time" : NumberLong("1497055835953")
},
{
"operand1" : 24,
"correct" : true,
"user_guess" : true,
"operand2" : 2,
"start_time" : NumberLong("1497055828630"),
"operator" : "%",
"end_time" : NumberLong("1497055830491")
},
{
"operand1" : 69,
"correct" : true,
"user_guess" : false,
"operand2" : 2,
"start_time" : NumberLong("1497055824300"),
"operator" : "%",
"end_time" : NumberLong("1497055825997")
},
{
"operand1" : 26,
"correct" : true,
"user_guess" : false,
"operand2" : 5,
"start_time" : NumberLong("1497055796831"),
"operator" : "%",
"end_time" : NumberLong("1497055798985")
},
{
"operand1" : 67,
"correct" : true,
"user_guess" : false,
"operand2" : 2,
"start_time" : NumberLong("1497055814628"),
"operator" : "%",
"end_time" : NumberLong("1497055816652")
},
{
"operand1" : 31,
"correct" : true,
"user_guess" : false,
"operand2" : 4,
"start_time" : NumberLong("1497055802959"),
"operator" : "%",
"end_time" : NumberLong("1497055804802")
}
],
{
"_id" : ObjectId("593b42366523ec06eed182b9"),
"session_start" : NumberLong("1497055796716"),
"uuid" : "urn:uuid:55e72720-c0b1-4e81-89d6-ac1896b06661",
"subtopic" : "divisibility superpowers"
"all_problems" : [{
"operand1" : 14,
"correct" : true,
"user_guess" : false,
"operand2" : 5,
"start_time" : NumberLong("1497055834697"),
"operator" : "%",
"end_time" : NumberLong("1497055835953")
},
{
"operand1" : 24,
"correct" : true,
"user_guess" : true,
"operand2" : 2,
"start_time" : NumberLong("1497055828630"),
"operator" : "%",
"end_time" : NumberLong("1497055830491")
},
{
"operand1" : 69,
"correct" : true,
"user_guess" : false,
"operand2" : 2,
"start_time" : NumberLong("1497055824300"),
"operator" : "%",
"end_time" : NumberLong("1497055825997")
},
{
"operand1" : 26,
"correct" : true,
"user_guess" : false,
"operand2" : 5,
"start_time" : NumberLong("1497055796831"),
"operator" : "%",
"end_time" : NumberLong("1497055798985")
},
{
"operand1" : 67,
"correct" : true,
"user_guess" : false,
"operand2" : 2,
"start_time" : NumberLong("1497055814628"),
"operator" : "%",
"end_time" : NumberLong("1497055816652")
},
{
"operand1" : 31,
"correct" : true,
"user_guess" : false,
"operand2" : 4,
"start_time" : NumberLong("1497055802959"),
"operator" : "%",
"end_time" : NumberLong("1497055804802")
}
],
{
"_id" : ObjectId("593b42366523ec06eed182b9"),
"session_start" : NumberLong("1497055796716"),
"uuid" : "urn:uuid:55e72720-c0b1-4e81-89d6-ac1896b06661",
"subtopic" : "divisibility superpowers"
"all_problems" : [{
"operand1" : 14,
"correct" : true,
"user_guess" : false,
"operand2" : 5,
"start_time" : NumberLong("1497055834697"),
"operator" : "%",
"end_time" : NumberLong("1497055835953")
},
{
"operand1" : 24,
"correct" : true,
"user_guess" : true,
"operand2" : 2,
"start_time" : NumberLong("1497055828630"),
"operator" : "%",
"end_time" : NumberLong("1497055830491")
},
{
"operand1" : 69,
"correct" : true,
"user_guess" : false,
"operand2" : 2,
"start_time" : NumberLong("1497055824300"),
"operator" : "%",
"end_time" : NumberLong("1497055825997")
},
{
"operand1" : 26,
"correct" : true,
"user_guess" : false,
"operand2" : 5,
"start_time" : NumberLong("1497055796831"),
"operator" : "%",
"end_time" : NumberLong("1497055798985")
},
{
"operand1" : 67,
"correct" : true,
"user_guess" : false,
"operand2" : 2,
"start_time" : NumberLong("1497055814628"),
"operator" : "%",
"end_time" : NumberLong("1497055816652")
},
{
"operand1" : 31,
"correct" : true,
"user_guess" : false,
"operand2" : 4,
"start_time" : NumberLong("1497055802959"),
"operator" : "%",
"end_time" : NumberLong("1497055804802")
}
],
MongoDB quick-look
MongoDB is a NoSQL database
Data is stored in documents
The schema can change! (even between documents)
PyMongo is the recommended Python driver
Document model
Database
Collection
Document(s)
Collection
Document(s)
...
Document model
Database
Collection
Document(s)
Collection
Document(s)
documents < collections < databases
from pymongo import MongoClient
# SET UP THE CONNECTION
client = MongoClient("localhost", 27017)
db = client["aprender"]
mathcards = client["mathcards"]
users = client["users"]
collections
from pymongo import MongoClient
from secure import MONGO_USERNAME, MONGO_PASSWORD
# SET UP THE CONNECTION
client = MongoClient("localhost", 27017)
db = client["aprender"]
mathcards = client["mathcards"]
users = client["users"]
# AUTHENTICATE THE CONNECTION
client.aprender.authenticate(MONGO_USERNAME,
MONGO_PASSWORD,
mechanism='SCRAM-SHA-1')
# find_one() returns one mathcard --> DICTIONARY
this_card = db.mathcards.find_one()
# find_one() returns one mathcard --> DICTIONARY
this_card = db.mathcards.find_one()
# find() returns 1+ mathcard(s) --> CURSOR
all_cards = db.mathcards.find()
# find_one() returns one mathcard --> DICTIONARY
this_card = db.mathcards.find_one()
# find() returns 1+ mathcard(s) --> CURSOR
all_cards = db.mathcards.find()
for card in all_cards:
for key in card.keys():
problem_data = card[key]
do_some_stuff(problem_data)
Queries, projections, etc. are documents
A document is like a Python dictionary
Example:
{ "uuid": some_uuid }
A document is like a Python dictionary
Example:
{ "uuid": some_uuid }
Usage:
.find({ "uuid": some_uuid })
Queries, projections, etc. are documents
# get data from a certain user
some_uuid = "urn:uuid:3f810ea0-3d27-43cc-87d7-0501635b3000"
my_data = db.mathcards.find(
{ "uuid": some_uuid }
)
'''cards greater than or equal to a certain
timestamp
'''
todays_cards = db.mathcards.find(
{ "session_start":
{ "$gte": 1493753538942 }
} )
AGGREGATION PIPELINES
MongoDB Aggregation Pipelines
A list of one or more stages
Very similar to UNIX pipes
Documents pass from one stage to the next
OVERALL CONCEPT OF THE AGGREGATION PIPELINE
RESULTS OF INTEREST
SOME DOCUMENTS
$group
ALL DOCUMENTS
$match
RESULTS
# task1a
all docs docs w/ specified
start time
result: the total
number of docs which
entered this stage
$count
$match
$project
$match
all docs docs w/ specified
start time
result: docs w/ new
info (number of probs
solved, by session)
# task2a
$match example
pipeline1 = [
{"$match": { "session_start": { "$gte":
this_morning } } },
]
name of key criteria
.aggregate() syntax
cursor1 = db.mathcards.aggregate(pipeline1)
for doc in cursor1:
print(doc)
Let’s code!!!
github.com/enrobyn/pymongo-tutorial
$count example
pipeline1a = [
{"$match": { "session_start": { "$gte":
this_morning } } },
{"$count": "total_sessions_today" }
]
# task1a
$project example
pipeline1b = [
{"$match": { "session_start": { "$gte":
this_morning } } },
{"$project": {"session_start": 1}}
]
# task1b
name of key
display flag
$project
pipeline2a = [
{"$match": { "session_start": { "$gte":
this_morning } } },
{"$project": {"session_probs" :
{"$size": "$___________"} } }
]
# task2a
$project
pipeline2a = [
{"$match": { "session_start": { "$gte":
this_morning } } },
{"$project": {"session_probs" :
{"$size": "$___________"} } }
]
# task2a
name of the list of problems
$project
pipeline2a = [
{"$match": { "session_start": { "$gte":
this_morning } } },
{"$project": {"session_probs" :
{"$size": "$all_problems"} } }
]
# task2a
$all_problems
$group
pipeline2b = [
{"$match" : {"session_start" : {"$gte" :
this_morning } } },
{"$project" : {"session_probs" : {"$size" :
"$all_problems"} } },
{"$group" : {
"_id" : None,
"total_problems" : {"$sum" :
"_____________"}
}}
]
# task2b
$group
pipeline2b = [
{"$match" : {"session_start" : {"$gte" :
this_morning } } },
{"$project" : {"session_probs" : {"$size" :
"$all_problems"} } },
{"$group" : {
"_id" : None,
"total_problems" : {"$sum" :
"$session_probs"}
}}
]
# task2b
$group
pipeline2b = [
{"$project" : {"session_probs" : {"$size" :
"$all_problems"} } },
{"$group" : {
"_id" : None,
"total_problems" : {"$sum" :
"$session_probs"}
}}
]
# task3
$avg
pipeline4 = [
{"$match" : {"session_start" : {"$gte" : this_morning } } },
{"$project" : {"session_probs" : {"$size" : "$all_problems"} } },
{"$group" : {
"_id" : None,
"avg_num_probs" : {"______": "__________"}
}}
]
# task4
?
$avg
pipeline4 = [
{"$match" : {"session_start" : {"$gte" : this_morning } } },
{"$project" : {"session_probs" : {"$size" : "$all_problems"} } },
{"$group" : {
"_id" : None,
"avg_num_probs" : {"$avg": "$session_probs"}
}}
]
# task4
$stdDevSamp
pipeline4 = [
{"$match" : {"session_start" : {"$gte" : this_morning } } },
{"$project" : {"session_probs" : {"$size" : "$all_problems"} } },
{"$group" : {
"_id" : None,
"std_dev_num_probs" :
{"_________": "_________"}
}}
]
# task5
?
$stdDevSamp
pipeline4 = [
{"$match" : {"session_start" : {"$gte" : this_morning } } },
{"$project" : {"session_probs" : {"$size" : "$all_problems"} } },
{"$group" : {
"_id" : None,
"std_dev_num_probs" :
{"$stdDevSamp": "$session_probs"}
}}
]
# task5
Individual work time
Search for tasks in the .py file
Take a moment to write one or more pipeline stages
Check end of file comments if stuck
Multi-stage aggregation pipelines
task6: Response time by operand2 [2,3,4,5,6,9] for one user
task7: Percent accuracy (“score”) by operand2 for one user
task8: Retrieve, for one user, operand2 w/ lowest score
task9: Retrieve, for one user, operand2 w/ fastest time
task10: Retrieve operand2 which challenged the most users
$match
# task6
all docs docs w/ a
certain uuid
$match
# task6
all docs docs w/ a
certain uuid
$unwind
$match
# task6
all docs docs w/ a
certain uuid
now every array
element in
all_problems
is a doc!
$unwind
$match
# task6
all docs docs w/ a
certain uuid
now every array
element in
all_problems
is a doc!
$project
$unwind
$match
# task6
all docs docs w/ a
certain uuid
now every array
element in
all_problems
is a doc!
add two fields
to the docs
(suppress
others)
$group
$project
$unwind
$match
# task6
all docs docs w/ a
certain uuid
now every array
element in
all_problems
is a doc!
add two fields
to the docs
(suppress
others)
RESULTS
$group
$project
$unwind
$match
# task6
all docs docs w/ a
certain uuid
now every array
element in
all_problems
is a doc!
add two fields
to the docs
(suppress
others)
group by
operand2,
get time spent
$unwind # task6
pipeline6 = [
{ "$match" : # $match on uuid_of_interest
{ "$unwind" : # $unwind array of problems
{ "$project": {
"operand2": # use dot notation
"time_spent": # compute time spent
},
{ "$group":{
"_id": # group on operand2
"avg_time_spent": # compute $avg
}
}
]
$unwind # task6pipeline6 = [
{ "$match" : {"uuid" : uuid_of_interest } },
{ "$unwind" : "$all_problems" },
{ "$project": {
"operand2": "$all_problems.operand2",
"time_spent": {"$subtract": ["$all_problems.end_time",
"$all_problems.start_time"]},
"session_start":1,
"_id":0}
},
{ "$group":{
"_id": {"operand2": "$operand2"},
"avg_time_spent": {"$avg": "$time_spent"},
}
}
]
$addFields # task7
pipeline7a = [
{ "$match" : ...
{ "$unwind" : ...
{ "$group":{
"_id": # $group on operand2
"total_attempted": # $sum
"total_correct":
}
},
{ "$addFields":{
"percent_accuracy":
}
}
]
$group for task7 # task7
{"$group":{
"_id": "$all_problems.operand2",
"total_attempted": {"$sum":1},
"total_correct":
{"$sum": { "$cond":
["$all_problems.correct", 1, 0]
} }
}
}
$addFields for task7 # task7
{"$addFields":{
"percent_accuracy": {"$divide":
["$total_correct",
"$total_attempted"]
}
}
}
task8 hints # task8
We want to find the operand2
which had the lowest score...
What stage(s) could you add to
pipeline7 in order to solve this?
$sort and $limit # task8
pipeline7.extend([
{"$sort": {"percent_accuracy": 1}},
{"$limit": 1}
])
task9 hints # task9
We want to find the operand2
which had the fastest time...
What stage(s) could you add to
pipeline6 in order to solve this?
$sort and $limit # task9
pipeline6.extend([
{"$sort":
{"avg_time_spent": 1}},
{"$limit": 1}
]}
Check out pipeline7a... # task10
Hint: Add a $sort stage to 7a
Check out pipeline7a... # task10
Hint: Add a $sort stage to 7a
{"$sort" : {"percent_accuracy": 1} }
Results! # task10
{'percent_accuracy': 0.7724425887265136, 'total_correct': 740,
'total_attempted': 958, '_id': {'op2': 3}}
{'percent_accuracy': 0.8316151202749141, 'total_correct': 726,
'total_attempted': 873, '_id': {'op2': 6}}
{'percent_accuracy': 0.8428731762065096, 'total_correct': 751,
'total_attempted': 891, '_id': {'op2': 9}}
{'percent_accuracy': 0.8562564632885212, 'total_correct': 828,
'total_attempted': 967, '_id': {'op2': 4}}
{'percent_accuracy': 0.9286510590858417, 'total_correct': 833,
'total_attempted': 897, '_id': {'op2': 5}}
{'percent_accuracy': 0.9333333333333333, 'total_correct': 882,
'total_attempted': 945, '_id': {'op2': 2}}
Results! # task10
{'percent_accuracy': 0.7724425887265136, 'total_correct': 740,
'total_attempted': 958, '_id': {'op2': 3}}
{'percent_accuracy': 0.8316151202749141, 'total_correct': 726,
'total_attempted': 873, '_id': {'op2': 6}}
{'percent_accuracy': 0.8428731762065096, 'total_correct': 751,
'total_attempted': 891, '_id': {'op2': 9}}
{'percent_accuracy': 0.8562564632885212, 'total_correct': 828,
'total_attempted': 967, '_id': {'op2': 4}}
{'percent_accuracy': 0.9286510590858417, 'total_correct': 833,
'total_attempted': 897, '_id': {'op2': 5}}
{'percent_accuracy': 0.9333333333333333, 'total_correct': 882,
'total_attempted': 945, '_id': {'op2': 2}}
Results! # task10
{'percent_accuracy': 0.7724425887265136, 'total_correct': 740,
'total_attempted': 958, '_id': {'op2': 3}}
{'percent_accuracy': 0.8316151202749141, 'total_correct': 726,
'total_attempted': 873, '_id': {'op2': 6}}
{'percent_accuracy': 0.8428731762065096, 'total_correct': 751,
'total_attempted': 891, '_id': {'op2': 9}}
{'percent_accuracy': 0.8562564632885212, 'total_correct': 828,
'total_attempted': 967, '_id': {'op2': 4}}
{'percent_accuracy': 0.9286510590858417, 'total_correct': 833,
'total_attempted': 897, '_id': {'op2': 5}}
{'percent_accuracy': 0.9333333333333333, 'total_correct': 882,
'total_attempted': 945, '_id': {'op2': 2}}
Conclusion
PyMongo = easy to learn
You can learn PyMongo
The aggregation pipeline enables you to run data science code
efficiently on your database servers without needing to move any
data
DATA LITERACY
THANK YOU
Resources
Asya Kamsky's talk! 4:30PM WED. in Grand Ballroom
“Powerful Analysis with the Aggregation Pipeline”
MongoDB University (free!)
https://university.mongodb.com/
Aggregation Pipeline Quick Reference
https://docs.mongodb.com/manual/meta/aggregation-quick-
reference/
MongoDB Day-long conferences
Operators useful for aggregation pipelines
$sort
$group
$map
$addFields
$let
$cond
$min
$max
$unwind
$limit
$project
$match
$push
$addToSet
$first
$sum
$eq
$divide
$multiply
$gt $lt $gte $lte
$cond
{"$cond":{
"if": {"$and":[ {"$eq": ["$$nth.label", "tryHarder"]},
{"$eq": ["$$nth.user_response", "yes"]},
{"$eq": ["$$nplus1.label", "tryHarder"]},
{"$eq": ["$$nplus1.user_response", "yes"]}
]},
"then": True,
"else": False
}}
$cond
{"$cond":{
"if": {"$and":[ {"$eq": ["$$nth.label", "tryHarder"]},
{"$eq": ["$$nth.user_response", "yes"]},
{"$eq": ["$$nplus1.label", "tryHarder"]},
{"$eq": ["$$nplus1.user_response", "yes"]}
]},
"then": True,
"else": False
}}
{"$eq": ["$$nth.label", "tryHarder"]},
{"$eq": ["$$nth.user_response", "yes"]},
{"$eq": ["$$nplus1.label", "tryHarder"]},
{"$eq": ["$$nplus1.user_response", "yes"]}
$eq
{"$and": [ {},{},{},{} ]}
$and
{
"_id" : ObjectId("593b42366523ec06eed182b9"),
"session_start" : NumberLong("1497055796716"),
"uuid" : "urn:uuid:55e72720-c0b1-4e81-89d6-ac1896b06661",
"subtopic" : "divisibility superpowers"
"all_problems" : [...],
"interventions" : [
{
"deploy_time" : NumberLong("1497055817986"),
"label" : "tryHarder",
"user_response" : "no",
"dismiss_time" : 2939
}
}
71
Sample
intervention

Building Your First Data Science Applicatino in MongoDB

  • 1.
    www.centralinventions.com Building Your FirstData Science App in MongoDB Robyn Allen | @enrobyn | MDBW 2017
  • 2.
  • 3.
  • 4.
    Math app goals Createa framework for student-accessible data science Improve STEM learning outcomes Increase math literacy!!!!!
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
    schema detail { ..., { "operand1" : 2073, "user_guess": false, "correct" : false, "operand2" : 3, "start_time": NumberLong("1497055796831"), "operator" : "%", "end_time" : NumberLong("1497055798985") }, ... }
  • 13.
    { ..., "all_problems" : [ { "operand1": 2073, "user_guess" : false, "correct" : false, "operand2" : 3, "start_time": NumberLong("1497055796831"), "operator" : "%", "end_time" : NumberLong("1497055798985") }, ... schema detail
  • 14.
    { ..., "all_problems" : [ { "operand1": 2073, "user_guess" : false, "correct" : false, "operand2" : 3, "start_time": NumberLong("1497055796831"), "operator" : "%", "end_time" : NumberLong("1497055798985") }, { "operand1" : 77, "correct" : true, "user_guess" : false, "operand2" : 4, "start_time" : NumberLong("1497055827450"), "operator" : "%", "end_time" : NumberLong("1497055828629") },
  • 15.
    { "all_problems" : [{ "operand1": 14, "correct" : true, "user_guess" : false, "operand2" : 5, "start_time" : NumberLong("1497055834697"), "operator" : "%", "end_time" : NumberLong("1497055835953") }, { "operand1" : 24, "correct" : true, "user_guess" : true, "operand2" : 2, "start_time" : NumberLong("1497055828630"), "operator" : "%", "end_time" : NumberLong("1497055830491") }, { "operand1" : 69, "correct" : true, "user_guess" : false, "operand2" : 2, "start_time" : NumberLong("1497055824300"), "operator" : "%", "end_time" : NumberLong("1497055825997") }, { "operand1" : 26, "correct" : true, "user_guess" : false, "operand2" : 5, "start_time" : NumberLong("1497055796831"), "operator" : "%", "end_time" : NumberLong("1497055798985") }, { "operand1" : 67, "correct" : true, "user_guess" : false, "operand2" : 2, "start_time" : NumberLong("1497055814628"), "operator" : "%", "end_time" : NumberLong("1497055816652") }, { "operand1" : 31, "correct" : true, "user_guess" : false, "operand2" : 4, "start_time" : NumberLong("1497055802959"), "operator" : "%", "end_time" : NumberLong("1497055804802") } ], ...
  • 16.
    { "_id" : ObjectId("593b42366523ec06eed182b9"), "session_start": NumberLong("1497055796716"), "uuid" : "urn:uuid:55e72720-c0b1-4e81-89d6-ac1896b06661", "subtopic" : "divisibility superpowers" "all_problems" : [{ "operand1" : 14, "correct" : true, "user_guess" : false, "operand2" : 5, "start_time" : NumberLong("1497055834697"), "operator" : "%", "end_time" : NumberLong("1497055835953") }, { "operand1" : 24, "correct" : true, "user_guess" : true, "operand2" : 2, "start_time" : NumberLong("1497055828630"), "operator" : "%", "end_time" : NumberLong("1497055830491") }, { "operand1" : 69, "correct" : true, "user_guess" : false, "operand2" : 2, "start_time" : NumberLong("1497055824300"), "operator" : "%", "end_time" : NumberLong("1497055825997") }, { "operand1" : 26, "correct" : true, "user_guess" : false, "operand2" : 5, "start_time" : NumberLong("1497055796831"), "operator" : "%", "end_time" : NumberLong("1497055798985") }, { "operand1" : 67, "correct" : true, "user_guess" : false, "operand2" : 2, "start_time" : NumberLong("1497055814628"), "operator" : "%", "end_time" : NumberLong("1497055816652") }, { "operand1" : 31, "correct" : true, "user_guess" : false, "operand2" : 4, "start_time" : NumberLong("1497055802959"), "operator" : "%", "end_time" : NumberLong("1497055804802") } ],
  • 17.
    { "_id" : ObjectId("593b42366523ec06eed182b9"), "session_start": NumberLong("1497055796716"), "uuid" : "urn:uuid:55e72720-c0b1-4e81-89d6-ac1896b06661", "subtopic" : "divisibility superpowers" "all_problems" : [{ "operand1" : 14, "correct" : true, "user_guess" : false, "operand2" : 5, "start_time" : NumberLong("1497055834697"), "operator" : "%", "end_time" : NumberLong("1497055835953") }, { "operand1" : 24, "correct" : true, "user_guess" : true, "operand2" : 2, "start_time" : NumberLong("1497055828630"), "operator" : "%", "end_time" : NumberLong("1497055830491") }, { "operand1" : 69, "correct" : true, "user_guess" : false, "operand2" : 2, "start_time" : NumberLong("1497055824300"), "operator" : "%", "end_time" : NumberLong("1497055825997") }, { "operand1" : 26, "correct" : true, "user_guess" : false, "operand2" : 5, "start_time" : NumberLong("1497055796831"), "operator" : "%", "end_time" : NumberLong("1497055798985") }, { "operand1" : 67, "correct" : true, "user_guess" : false, "operand2" : 2, "start_time" : NumberLong("1497055814628"), "operator" : "%", "end_time" : NumberLong("1497055816652") }, { "operand1" : 31, "correct" : true, "user_guess" : false, "operand2" : 4, "start_time" : NumberLong("1497055802959"), "operator" : "%", "end_time" : NumberLong("1497055804802") } ],
  • 18.
    { "_id" : ObjectId("593b42366523ec06eed182b9"), "session_start": NumberLong("1497055796716"), "uuid" : "urn:uuid:55e72720-c0b1-4e81-89d6-ac1896b06661", "subtopic" : "divisibility superpowers" "all_problems" : [{ "operand1" : 14, "correct" : true, "user_guess" : false, "operand2" : 5, "start_time" : NumberLong("1497055834697"), "operator" : "%", "end_time" : NumberLong("1497055835953") }, { "operand1" : 24, "correct" : true, "user_guess" : true, "operand2" : 2, "start_time" : NumberLong("1497055828630"), "operator" : "%", "end_time" : NumberLong("1497055830491") }, { "operand1" : 69, "correct" : true, "user_guess" : false, "operand2" : 2, "start_time" : NumberLong("1497055824300"), "operator" : "%", "end_time" : NumberLong("1497055825997") }, { "operand1" : 26, "correct" : true, "user_guess" : false, "operand2" : 5, "start_time" : NumberLong("1497055796831"), "operator" : "%", "end_time" : NumberLong("1497055798985") }, { "operand1" : 67, "correct" : true, "user_guess" : false, "operand2" : 2, "start_time" : NumberLong("1497055814628"), "operator" : "%", "end_time" : NumberLong("1497055816652") }, { "operand1" : 31, "correct" : true, "user_guess" : false, "operand2" : 4, "start_time" : NumberLong("1497055802959"), "operator" : "%", "end_time" : NumberLong("1497055804802") } ],
  • 19.
    MongoDB quick-look MongoDB isa NoSQL database Data is stored in documents The schema can change! (even between documents) PyMongo is the recommended Python driver
  • 20.
  • 21.
  • 22.
    from pymongo importMongoClient # SET UP THE CONNECTION client = MongoClient("localhost", 27017) db = client["aprender"] mathcards = client["mathcards"] users = client["users"] collections
  • 23.
    from pymongo importMongoClient from secure import MONGO_USERNAME, MONGO_PASSWORD # SET UP THE CONNECTION client = MongoClient("localhost", 27017) db = client["aprender"] mathcards = client["mathcards"] users = client["users"] # AUTHENTICATE THE CONNECTION client.aprender.authenticate(MONGO_USERNAME, MONGO_PASSWORD, mechanism='SCRAM-SHA-1')
  • 24.
    # find_one() returnsone mathcard --> DICTIONARY this_card = db.mathcards.find_one()
  • 25.
    # find_one() returnsone mathcard --> DICTIONARY this_card = db.mathcards.find_one() # find() returns 1+ mathcard(s) --> CURSOR all_cards = db.mathcards.find()
  • 26.
    # find_one() returnsone mathcard --> DICTIONARY this_card = db.mathcards.find_one() # find() returns 1+ mathcard(s) --> CURSOR all_cards = db.mathcards.find() for card in all_cards: for key in card.keys(): problem_data = card[key] do_some_stuff(problem_data)
  • 27.
    Queries, projections, etc.are documents A document is like a Python dictionary Example: { "uuid": some_uuid }
  • 28.
    A document islike a Python dictionary Example: { "uuid": some_uuid } Usage: .find({ "uuid": some_uuid }) Queries, projections, etc. are documents
  • 29.
    # get datafrom a certain user some_uuid = "urn:uuid:3f810ea0-3d27-43cc-87d7-0501635b3000" my_data = db.mathcards.find( { "uuid": some_uuid } )
  • 30.
    '''cards greater thanor equal to a certain timestamp ''' todays_cards = db.mathcards.find( { "session_start": { "$gte": 1493753538942 } } )
  • 31.
  • 32.
    MongoDB Aggregation Pipelines Alist of one or more stages Very similar to UNIX pipes Documents pass from one stage to the next
  • 33.
    OVERALL CONCEPT OFTHE AGGREGATION PIPELINE RESULTS OF INTEREST SOME DOCUMENTS $group ALL DOCUMENTS $match
  • 34.
    RESULTS # task1a all docsdocs w/ specified start time result: the total number of docs which entered this stage $count $match
  • 35.
    $project $match all docs docsw/ specified start time result: docs w/ new info (number of probs solved, by session) # task2a
  • 36.
    $match example pipeline1 =[ {"$match": { "session_start": { "$gte": this_morning } } }, ] name of key criteria
  • 37.
    .aggregate() syntax cursor1 =db.mathcards.aggregate(pipeline1) for doc in cursor1: print(doc)
  • 38.
  • 39.
    $count example pipeline1a =[ {"$match": { "session_start": { "$gte": this_morning } } }, {"$count": "total_sessions_today" } ] # task1a
  • 40.
    $project example pipeline1b =[ {"$match": { "session_start": { "$gte": this_morning } } }, {"$project": {"session_start": 1}} ] # task1b name of key display flag
  • 41.
    $project pipeline2a = [ {"$match":{ "session_start": { "$gte": this_morning } } }, {"$project": {"session_probs" : {"$size": "$___________"} } } ] # task2a
  • 42.
    $project pipeline2a = [ {"$match":{ "session_start": { "$gte": this_morning } } }, {"$project": {"session_probs" : {"$size": "$___________"} } } ] # task2a name of the list of problems
  • 43.
    $project pipeline2a = [ {"$match":{ "session_start": { "$gte": this_morning } } }, {"$project": {"session_probs" : {"$size": "$all_problems"} } } ] # task2a $all_problems
  • 44.
    $group pipeline2b = [ {"$match": {"session_start" : {"$gte" : this_morning } } }, {"$project" : {"session_probs" : {"$size" : "$all_problems"} } }, {"$group" : { "_id" : None, "total_problems" : {"$sum" : "_____________"} }} ] # task2b
  • 45.
    $group pipeline2b = [ {"$match": {"session_start" : {"$gte" : this_morning } } }, {"$project" : {"session_probs" : {"$size" : "$all_problems"} } }, {"$group" : { "_id" : None, "total_problems" : {"$sum" : "$session_probs"} }} ] # task2b
  • 46.
    $group pipeline2b = [ {"$project": {"session_probs" : {"$size" : "$all_problems"} } }, {"$group" : { "_id" : None, "total_problems" : {"$sum" : "$session_probs"} }} ] # task3
  • 47.
    $avg pipeline4 = [ {"$match": {"session_start" : {"$gte" : this_morning } } }, {"$project" : {"session_probs" : {"$size" : "$all_problems"} } }, {"$group" : { "_id" : None, "avg_num_probs" : {"______": "__________"} }} ] # task4 ?
  • 48.
    $avg pipeline4 = [ {"$match": {"session_start" : {"$gte" : this_morning } } }, {"$project" : {"session_probs" : {"$size" : "$all_problems"} } }, {"$group" : { "_id" : None, "avg_num_probs" : {"$avg": "$session_probs"} }} ] # task4
  • 49.
    $stdDevSamp pipeline4 = [ {"$match": {"session_start" : {"$gte" : this_morning } } }, {"$project" : {"session_probs" : {"$size" : "$all_problems"} } }, {"$group" : { "_id" : None, "std_dev_num_probs" : {"_________": "_________"} }} ] # task5 ?
  • 50.
    $stdDevSamp pipeline4 = [ {"$match": {"session_start" : {"$gte" : this_morning } } }, {"$project" : {"session_probs" : {"$size" : "$all_problems"} } }, {"$group" : { "_id" : None, "std_dev_num_probs" : {"$stdDevSamp": "$session_probs"} }} ] # task5
  • 51.
    Individual work time Searchfor tasks in the .py file Take a moment to write one or more pipeline stages Check end of file comments if stuck
  • 52.
    Multi-stage aggregation pipelines task6:Response time by operand2 [2,3,4,5,6,9] for one user task7: Percent accuracy (“score”) by operand2 for one user task8: Retrieve, for one user, operand2 w/ lowest score task9: Retrieve, for one user, operand2 w/ fastest time task10: Retrieve operand2 which challenged the most users
  • 53.
    $match # task6 all docsdocs w/ a certain uuid
  • 54.
    $match # task6 all docsdocs w/ a certain uuid
  • 55.
    $unwind $match # task6 all docsdocs w/ a certain uuid now every array element in all_problems is a doc!
  • 56.
    $unwind $match # task6 all docsdocs w/ a certain uuid now every array element in all_problems is a doc!
  • 57.
    $project $unwind $match # task6 all docsdocs w/ a certain uuid now every array element in all_problems is a doc! add two fields to the docs (suppress others)
  • 58.
    $group $project $unwind $match # task6 all docsdocs w/ a certain uuid now every array element in all_problems is a doc! add two fields to the docs (suppress others)
  • 59.
    RESULTS $group $project $unwind $match # task6 all docsdocs w/ a certain uuid now every array element in all_problems is a doc! add two fields to the docs (suppress others) group by operand2, get time spent
  • 60.
    $unwind # task6 pipeline6= [ { "$match" : # $match on uuid_of_interest { "$unwind" : # $unwind array of problems { "$project": { "operand2": # use dot notation "time_spent": # compute time spent }, { "$group":{ "_id": # group on operand2 "avg_time_spent": # compute $avg } } ]
  • 61.
    $unwind # task6pipeline6= [ { "$match" : {"uuid" : uuid_of_interest } }, { "$unwind" : "$all_problems" }, { "$project": { "operand2": "$all_problems.operand2", "time_spent": {"$subtract": ["$all_problems.end_time", "$all_problems.start_time"]}, "session_start":1, "_id":0} }, { "$group":{ "_id": {"operand2": "$operand2"}, "avg_time_spent": {"$avg": "$time_spent"}, } } ]
  • 62.
    $addFields # task7 pipeline7a= [ { "$match" : ... { "$unwind" : ... { "$group":{ "_id": # $group on operand2 "total_attempted": # $sum "total_correct": } }, { "$addFields":{ "percent_accuracy": } } ]
  • 63.
    $group for task7# task7 {"$group":{ "_id": "$all_problems.operand2", "total_attempted": {"$sum":1}, "total_correct": {"$sum": { "$cond": ["$all_problems.correct", 1, 0] } } } }
  • 64.
    $addFields for task7# task7 {"$addFields":{ "percent_accuracy": {"$divide": ["$total_correct", "$total_attempted"] } } }
  • 65.
    task8 hints #task8 We want to find the operand2 which had the lowest score... What stage(s) could you add to pipeline7 in order to solve this?
  • 66.
    $sort and $limit# task8 pipeline7.extend([ {"$sort": {"percent_accuracy": 1}}, {"$limit": 1} ])
  • 67.
    task9 hints #task9 We want to find the operand2 which had the fastest time... What stage(s) could you add to pipeline6 in order to solve this?
  • 68.
    $sort and $limit# task9 pipeline6.extend([ {"$sort": {"avg_time_spent": 1}}, {"$limit": 1} ]}
  • 69.
    Check out pipeline7a...# task10 Hint: Add a $sort stage to 7a
  • 70.
    Check out pipeline7a...# task10 Hint: Add a $sort stage to 7a {"$sort" : {"percent_accuracy": 1} }
  • 71.
    Results! # task10 {'percent_accuracy':0.7724425887265136, 'total_correct': 740, 'total_attempted': 958, '_id': {'op2': 3}} {'percent_accuracy': 0.8316151202749141, 'total_correct': 726, 'total_attempted': 873, '_id': {'op2': 6}} {'percent_accuracy': 0.8428731762065096, 'total_correct': 751, 'total_attempted': 891, '_id': {'op2': 9}} {'percent_accuracy': 0.8562564632885212, 'total_correct': 828, 'total_attempted': 967, '_id': {'op2': 4}} {'percent_accuracy': 0.9286510590858417, 'total_correct': 833, 'total_attempted': 897, '_id': {'op2': 5}} {'percent_accuracy': 0.9333333333333333, 'total_correct': 882, 'total_attempted': 945, '_id': {'op2': 2}}
  • 72.
    Results! # task10 {'percent_accuracy':0.7724425887265136, 'total_correct': 740, 'total_attempted': 958, '_id': {'op2': 3}} {'percent_accuracy': 0.8316151202749141, 'total_correct': 726, 'total_attempted': 873, '_id': {'op2': 6}} {'percent_accuracy': 0.8428731762065096, 'total_correct': 751, 'total_attempted': 891, '_id': {'op2': 9}} {'percent_accuracy': 0.8562564632885212, 'total_correct': 828, 'total_attempted': 967, '_id': {'op2': 4}} {'percent_accuracy': 0.9286510590858417, 'total_correct': 833, 'total_attempted': 897, '_id': {'op2': 5}} {'percent_accuracy': 0.9333333333333333, 'total_correct': 882, 'total_attempted': 945, '_id': {'op2': 2}}
  • 73.
    Results! # task10 {'percent_accuracy':0.7724425887265136, 'total_correct': 740, 'total_attempted': 958, '_id': {'op2': 3}} {'percent_accuracy': 0.8316151202749141, 'total_correct': 726, 'total_attempted': 873, '_id': {'op2': 6}} {'percent_accuracy': 0.8428731762065096, 'total_correct': 751, 'total_attempted': 891, '_id': {'op2': 9}} {'percent_accuracy': 0.8562564632885212, 'total_correct': 828, 'total_attempted': 967, '_id': {'op2': 4}} {'percent_accuracy': 0.9286510590858417, 'total_correct': 833, 'total_attempted': 897, '_id': {'op2': 5}} {'percent_accuracy': 0.9333333333333333, 'total_correct': 882, 'total_attempted': 945, '_id': {'op2': 2}}
  • 74.
    Conclusion PyMongo = easyto learn You can learn PyMongo The aggregation pipeline enables you to run data science code efficiently on your database servers without needing to move any data
  • 75.
  • 76.
  • 77.
    Resources Asya Kamsky's talk!4:30PM WED. in Grand Ballroom “Powerful Analysis with the Aggregation Pipeline” MongoDB University (free!) https://university.mongodb.com/ Aggregation Pipeline Quick Reference https://docs.mongodb.com/manual/meta/aggregation-quick- reference/ MongoDB Day-long conferences
  • 78.
    Operators useful foraggregation pipelines $sort $group $map $addFields $let $cond $min $max $unwind $limit $project $match $push $addToSet $first $sum $eq $divide $multiply $gt $lt $gte $lte
  • 79.
    $cond {"$cond":{ "if": {"$and":[ {"$eq":["$$nth.label", "tryHarder"]}, {"$eq": ["$$nth.user_response", "yes"]}, {"$eq": ["$$nplus1.label", "tryHarder"]}, {"$eq": ["$$nplus1.user_response", "yes"]} ]}, "then": True, "else": False }}
  • 80.
    $cond {"$cond":{ "if": {"$and":[ {"$eq":["$$nth.label", "tryHarder"]}, {"$eq": ["$$nth.user_response", "yes"]}, {"$eq": ["$$nplus1.label", "tryHarder"]}, {"$eq": ["$$nplus1.user_response", "yes"]} ]}, "then": True, "else": False }}
  • 81.
    {"$eq": ["$$nth.label", "tryHarder"]}, {"$eq":["$$nth.user_response", "yes"]}, {"$eq": ["$$nplus1.label", "tryHarder"]}, {"$eq": ["$$nplus1.user_response", "yes"]} $eq
  • 82.
  • 83.
    { "_id" : ObjectId("593b42366523ec06eed182b9"), "session_start": NumberLong("1497055796716"), "uuid" : "urn:uuid:55e72720-c0b1-4e81-89d6-ac1896b06661", "subtopic" : "divisibility superpowers" "all_problems" : [...], "interventions" : [ { "deploy_time" : NumberLong("1497055817986"), "label" : "tryHarder", "user_response" : "no", "dismiss_time" : 2939 } } 71
  • 84.

Editor's Notes

  • #48 From docs: “you can specify an _id value of null to calculate accumulated values for all the input documents as a whole"
  • #49 From docs: “you can specify an _id value of null to calculate accumulated values for all the input documents as a whole"
  • #56 “Deconstructs an array field from the input documents to output a document for each element. Each output document is the input document with the value of the array field replaced by the element.”