Using Text Mining to Infer the Purpose of
Permission Use in Mobile Apps
Haoyu Wang, Peking University
Jason Hong, Carnegie Mellon University
Yao Guo, Peking University
UbiComp 2015, Osaka, Japan
+++++++++++++++++++++++
 tempMobile apps can access a wealth of
sensitive data and sensors
Android Permission
 Android currently requires developers to declare
what permissions an app uses
 e.g., to access precise location need to request
“ACCESS_FINE_LOCATION” permission
 Mobile users could also set permission preference
for each app
 AppOps: Android 4.3
 Third-party permission manager
What are your apps really doing?
 An app could use a permission for multiple
purposes
Location for:
advertising, map, nearby
searching, and geotagging
POCO food camera
What are your app really doing?
 Many apps have unusual permissions
Location Data
Unique Device ID
Brightest Flashlight
Location Data
Unique Device ID
Network State
Holy Bible
However, Android offers no mechanisms
to specify the purpose of how the
permission (sensitive data) will be used
It’s important to know the purpose
 Knowing the purpose of permission use could:
Help respect to privacy:
offering end-users more
insights as to why an app is
using a specific sensitive data. Fine-grained access control
How to infer the purpose of permission use?
Related Work(1)
 Inferring permission use from app description
 WHYPER[USENIX Security ’13] and AutoCog[CCS ’15]
propose to use NLP techniques to infer permission use from app
descriptions
 Limitation
 However, for more than 90% of apps, it’s impossible to
understand why permissions are used based solely on
app descriptions
Related Work(2)
 Inferring purpose by analyzing third-party
libraries
 Lin et al. [UbiComp’12] proposed to infer the
purpose of a permission request by analyzing third-
party libraries used in an app
 Limitation
 Most permission uses occur in custom code
 e.g., for apps that use the contacts permission, more
than 71.2% of them use it in their custom code
(privacygrade.org)
How to infer the purpose of a permission
use in custom-written code?
Our Approach
 Focus on inferring the purpose of permission
uses in custom code
 Location and contacts permission
 Key idea:
 Compiled Java code retain the text of many
identifiers, which offer a hint as to what the code is
doing (if not obfuscated)
void PhotoTag(){
…
loc = getLocation()
…
exif = readExif()
…
geotagging
How does the system work?
decompiled code:
…..
loc = getLocation(…)
…
keywords:
weather,
temperature,
wind,
….
Other features
classifier
App
Preprocessing
Feature
Extraction
Classification
Pre-processing
 Decompiling app
 Identify Permission-related code
 leverage the permission mapping provided by
PScout [CCS’12]
DEX
Smali
JAR Java
ApkTool
Dex2Jar
JD-JUI
Identify permission-
related code using
Smali code
Use Java code to
extract features
Feature Extraction(1)
Type Features Method
App-specific
features
Call frequency of each
permission-related API
calls, Intent, Content
Provider URIs
Static
Analysis
Text-based
features
Key words extracted
from identifiers
Text Mining
Feature Extraction(2)
 App specific features
 Highly related to app behaviors
• e.g., API “sendTextMessage()” is often used for the “Call
and SMS” purpose
• e.g., APIs in the package “com.android.email.activity”
are often used for “email” purpose
 We use a list of 680 documented permission
related APIs, 97 Intent and 78 Content
Provider URI strings.
Feature Extraction(3)
 Text-based features
 extract keywords from various identifiers as features
 Package/class/method/field names are preserved
when compiling
(1) Explicit patterns:
e.g.,
getHotel -> get, hotel
(2) Directory-based:
e.g.,
nearbyrestaurant ->
nearby, restaurant
Splitting Identifiers
Filter stop words:
e.g., get, set,
public, string
e.g.,
hotels -> hotel
restaurant -> restaur
Filtering Stemming TF-IDF
weather: 0.81236
wind: 0.76523
…
Classification
 Feature normalization
 Normalize features by scaling them to [0, 1]
 Supervised learning
 Manually label instances for each purpose
 A taxonomy of purposes
 10 purposes for the location permission and 10
purposes for the contacts permission
A taxonomy of purposes
Location permission
Search nearby places
Location-based
customization
Transportation information
Recording
Map and navigation
Geosocial networking
Geotagging
Location spoofing
Alert and remind
Location-based game
Contacts permission
Backup and Synchronization
Contact management
Blacklist
Call and SMS
Contact-based customization
Email
Find friends
Record
Fake calls and SMS
Remind
Evaluation
 Dataset
 7,923 apps from Google Play
• IDF is calculated based on a corpus of these apps
 Labeling purposes
 10 Cross-validation
1020 instances
560 instances
for contacts
460 instances
for location
Experiment Result
 Result of inferring purposes
Classifier Accuracy Precision Recall
Location
SVM 81.74% 85.51% 83.20%
Maximum Entropy 85.00% 87.07% 85.88%
C4.5 79.57% 83.26% 77%
Contacts
SVM 93.94% 94.38% 92.94%
Maximum Entropy 94.64% 94.42% 93.96%
C4.5 92.86% 91.36% 89.59%
Qualitative Analysis
 Determining factor:
 distinctive features, and the number of features
 Findings:
 Categories with high precision and recall tend to
have distinctive features.
 Most misclassified instances have fewer features
Feature comparison
 Text-based features achieve good accuracy alone
 App-specific features offer marginal
improvements
Permission Classifier Accuracy
(text features)
Accuracy
(all features)
Diff
Location Maximum
Entropy
81.97% 85.00% 3.03%
Contacts Maximum
Entropy
93.57% 94.64% 1.07%
Summary
 Develop a method to infer the purpose of
permission use in custom code
 Based on text-mining and machine-learning
 Evaluate our approach for two frequently used
permissions and could achieve high accuracy
Future work
 Infer the purpose and monitor app behaviors
at runtime
 Enforce access control based on purposes
 Automated infer the purpose of third-party
libraries
 Previous work only use manually labeled white-list
Take-away Message
 Knowing the purpose of
permission use is important
 An app could use a permission
for different purposes
 People have different level of concern for sensitive
data use for different purposes
 Text-mining on the decompiled code is useful to
infer the purpose of permission use
Thanks! Questions?
howiepku@pku.edu.cn
sei.pku.edu.cn/~wanghy11

Using Text Mining to Infer the Purpose of Permission Use in Mobile Apps

  • 1.
    Using Text Miningto Infer the Purpose of Permission Use in Mobile Apps Haoyu Wang, Peking University Jason Hong, Carnegie Mellon University Yao Guo, Peking University UbiComp 2015, Osaka, Japan
  • 2.
    +++++++++++++++++++++++  tempMobile appscan access a wealth of sensitive data and sensors
  • 3.
    Android Permission  Androidcurrently requires developers to declare what permissions an app uses  e.g., to access precise location need to request “ACCESS_FINE_LOCATION” permission  Mobile users could also set permission preference for each app  AppOps: Android 4.3  Third-party permission manager
  • 4.
    What are yourapps really doing?  An app could use a permission for multiple purposes Location for: advertising, map, nearby searching, and geotagging POCO food camera
  • 5.
    What are yourapp really doing?  Many apps have unusual permissions Location Data Unique Device ID Brightest Flashlight Location Data Unique Device ID Network State Holy Bible
  • 6.
    However, Android offersno mechanisms to specify the purpose of how the permission (sensitive data) will be used
  • 7.
    It’s important toknow the purpose  Knowing the purpose of permission use could: Help respect to privacy: offering end-users more insights as to why an app is using a specific sensitive data. Fine-grained access control
  • 8.
    How to inferthe purpose of permission use?
  • 9.
    Related Work(1)  Inferringpermission use from app description  WHYPER[USENIX Security ’13] and AutoCog[CCS ’15] propose to use NLP techniques to infer permission use from app descriptions  Limitation  However, for more than 90% of apps, it’s impossible to understand why permissions are used based solely on app descriptions
  • 10.
    Related Work(2)  Inferringpurpose by analyzing third-party libraries  Lin et al. [UbiComp’12] proposed to infer the purpose of a permission request by analyzing third- party libraries used in an app  Limitation  Most permission uses occur in custom code  e.g., for apps that use the contacts permission, more than 71.2% of them use it in their custom code (privacygrade.org)
  • 11.
    How to inferthe purpose of a permission use in custom-written code?
  • 12.
    Our Approach  Focuson inferring the purpose of permission uses in custom code  Location and contacts permission  Key idea:  Compiled Java code retain the text of many identifiers, which offer a hint as to what the code is doing (if not obfuscated) void PhotoTag(){ … loc = getLocation() … exif = readExif() … geotagging
  • 13.
    How does thesystem work? decompiled code: ….. loc = getLocation(…) … keywords: weather, temperature, wind, …. Other features classifier App Preprocessing Feature Extraction Classification
  • 14.
    Pre-processing  Decompiling app Identify Permission-related code  leverage the permission mapping provided by PScout [CCS’12] DEX Smali JAR Java ApkTool Dex2Jar JD-JUI Identify permission- related code using Smali code Use Java code to extract features
  • 15.
    Feature Extraction(1) Type FeaturesMethod App-specific features Call frequency of each permission-related API calls, Intent, Content Provider URIs Static Analysis Text-based features Key words extracted from identifiers Text Mining
  • 16.
    Feature Extraction(2)  Appspecific features  Highly related to app behaviors • e.g., API “sendTextMessage()” is often used for the “Call and SMS” purpose • e.g., APIs in the package “com.android.email.activity” are often used for “email” purpose  We use a list of 680 documented permission related APIs, 97 Intent and 78 Content Provider URI strings.
  • 17.
    Feature Extraction(3)  Text-basedfeatures  extract keywords from various identifiers as features  Package/class/method/field names are preserved when compiling (1) Explicit patterns: e.g., getHotel -> get, hotel (2) Directory-based: e.g., nearbyrestaurant -> nearby, restaurant Splitting Identifiers Filter stop words: e.g., get, set, public, string e.g., hotels -> hotel restaurant -> restaur Filtering Stemming TF-IDF weather: 0.81236 wind: 0.76523 …
  • 18.
    Classification  Feature normalization Normalize features by scaling them to [0, 1]  Supervised learning  Manually label instances for each purpose  A taxonomy of purposes  10 purposes for the location permission and 10 purposes for the contacts permission
  • 19.
    A taxonomy ofpurposes Location permission Search nearby places Location-based customization Transportation information Recording Map and navigation Geosocial networking Geotagging Location spoofing Alert and remind Location-based game Contacts permission Backup and Synchronization Contact management Blacklist Call and SMS Contact-based customization Email Find friends Record Fake calls and SMS Remind
  • 20.
    Evaluation  Dataset  7,923apps from Google Play • IDF is calculated based on a corpus of these apps  Labeling purposes  10 Cross-validation 1020 instances 560 instances for contacts 460 instances for location
  • 21.
    Experiment Result  Resultof inferring purposes Classifier Accuracy Precision Recall Location SVM 81.74% 85.51% 83.20% Maximum Entropy 85.00% 87.07% 85.88% C4.5 79.57% 83.26% 77% Contacts SVM 93.94% 94.38% 92.94% Maximum Entropy 94.64% 94.42% 93.96% C4.5 92.86% 91.36% 89.59%
  • 22.
    Qualitative Analysis  Determiningfactor:  distinctive features, and the number of features  Findings:  Categories with high precision and recall tend to have distinctive features.  Most misclassified instances have fewer features
  • 23.
    Feature comparison  Text-basedfeatures achieve good accuracy alone  App-specific features offer marginal improvements Permission Classifier Accuracy (text features) Accuracy (all features) Diff Location Maximum Entropy 81.97% 85.00% 3.03% Contacts Maximum Entropy 93.57% 94.64% 1.07%
  • 24.
    Summary  Develop amethod to infer the purpose of permission use in custom code  Based on text-mining and machine-learning  Evaluate our approach for two frequently used permissions and could achieve high accuracy
  • 25.
    Future work  Inferthe purpose and monitor app behaviors at runtime  Enforce access control based on purposes  Automated infer the purpose of third-party libraries  Previous work only use manually labeled white-list
  • 26.
    Take-away Message  Knowingthe purpose of permission use is important  An app could use a permission for different purposes  People have different level of concern for sensitive data use for different purposes  Text-mining on the decompiled code is useful to infer the purpose of permission use Thanks! Questions? [email protected] sei.pku.edu.cn/~wanghy11