Assignment4 - Fall 2024_553_dsci
Assignment4 - Fall 2024_553_dsci
Fall 2024
Assignment 4
Deadline: November 14 - 11:59 PM PST
2. Requirements
2.1 Programming Requirements
a. For Task 1, you can use the Spark DataFrame and GraphFrames library. For task 2 you can ONLY use
Spark RDD and standard Python or Scala libraries. There will be a 10% bonus for each task if you also
submit a Scala implementation and both your Python and Scala implementations are correct.
3. Datasets
We have generated a sub-dataset, ub_sample_data.csv, from the Yelp review dataset containing user_id
and business_id. You can find the data on Vocareum under resource/asnlib/publicdata/.
4. Tasks
4.1 Graph Construction
To construct the social network graph, assume that each node is uniquely labeled and that links are
undirected and unweighted.
Each node represents a user. There should be an edge between two nodes if the number of common
businesses reviewed by two users is greater than or equivalent to the filter threshold. For example,
suppose user1 reviewed set{business1, business2, business3} and user2 reviewed set{business2,
business3, business4, business5}. If the threshold is 2, there will be an edge between user1 and user2.
If the user node has no edge, we will not include that node in the graph.
The filter threshold will be given as an input parameter when running your code.
For output, you should use the python built-in round() function to round the betweenness value to five
digits after the decimal point. (Rounding is for output only, please do not use the rounded numbers for
further calculation)
IMPORTANT: Please strictly follow the output format since your code will be graded automatically. We
will not regrade because of formatting issues.
If the community only has one user node, we still regard it as a valid community.
You need to save your result in a txt file. The format is the same as the output file from task 1.
Hints:
1. For task 2.2, you should take into account the precision. For example: stop the modularity
calculation only if there is a significant reduction in the new modularity.
2. A=1 when BOTH i in j and j in i. Not just i in j or j in i.
3. For task 2.2 the stopping criteria plays an important role. Again, avoid the temptation to stop
your search at the first decrease in modularity. Instead, continue exploring all potential
partitions to find the global maximum. This comprehensive approach ensures that you don't
miss the optimal solution.
4. If you want to do a thorough checking of the answer, you can always calculate the modularity for
all possible communities (stop until no edges remain).
5. In modularity calculation, For A, using current graph; for kikj, using original graph.
IMPORTANT: Please strictly follow the hints as your code will be graded on a
different dataset. Passing the submission dataset does not guarantee passing
the grading dataset unless you strictly follow all the hints above and you may
lose points because of that. We will not regrade for any points lost due to
this. PLEASE DO FOLLOW ALL THE HINTS ABOVE.
If your runtime exceeds the above limit, there will be no point for this task.
5. About Vocareum
a. Dataset is under the directory $ASNLIB/publicdata/, jar package is under $ASNLIB/public/
b. You should upload the required files under your workspace: work/, and click submit
c. You should test your scripts on both the local machine and the Vocareum terminal before
submission.
d. During the submission period, the Vocareum will automatically test task1 and task2.
e. During the grading period, the Vocareum will use another dataset that has the same format for
testing.
f. We do not test the Scala implementation during the submission period.
g. Vocareum will automatically run both Python and Scala implementations during the grading period.
h. Please start your assignment early! You can resubmit any script on Vocareum. We will only grade on
your last submission.
6. Grading Criteria
(% penalty = % penalty of possible points you get)
1. You can use your free 5-day extension separately or together
a. Late Day Form
b. This form will record the number of late days you use for each assignment. We will not
count late days if no request is submitted. Remember to submit the request BEFORE
the deadline.
2. There will be a 10% bonus if you use both Scala and Python.
3. We will combine all the code we can find from the web (e.g., Github) as well as other students’ code
from this and other (previous) sections for plagiarism detection.
4. All submissions will be graded on the Vocareum. Please strictly follow the format provided, otherwise
you can’t get the point even though the answer is correct.
5. If the outputs of your program are unsorted or partially sorted, there will be a 50% penalty.
6. We can regrade your assignments within seven days once the scores are released. No argument after
one week.
7. There will be a 20% penalty for late submission within a week and no point after a week.
8. Only when your results from Python are correct, the bonus of using Scala will be calculated. There is no
partial point for Scala.
"export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64"
2. Check the input command line formats.
3. Check the output formats, for example, the headers, tags, typos.
4. Check the requirements of sorting the results.
5. Your program scripts should be named as task1.py task2.py etc.
6. Check whether your local environment fits the assignment description, i.e. version, configuration.
7. If you implement the core part in python instead of spark, or implement it with a high time
complexity (e.g. search an element in a list instead of a set), your program may be killed on the
Vocareum because it runs too slow.
8. You are required to only use Spark RDD in order to understand Spark operations more deeply. You
will not get any points if you use Spark DataFrame or DataSet. Don’t import sparksql.
9. Do not use Vocareum for debugging purposes, please debug on your local machine. Vocareum can
be very slow if you use it for debugging.
10. Vocareum is reliable in helping you to check the input and output formats, but its function on
checking the code correctness is limited. It can not guarantee the correctness of the code even with
a full score in the submission report.
11. Some students encounter an error like: the output rate …. has exceeded the allowed
value ….bytes/s; attempting to kill the process.
To resolve this, please remove all print statements and set the Spark logging level such that it
limits the logs generated - that can be done using sc.setLogLevel . Preferably, set the log level to
either WARN or ERROR when submitting your code.