Measuring Project Exposure Using Types Declared in a
Java Project
Michael D. Feist Ian Watts Abram Hindle
Department of Computing Department of Computing Department of Computing
Science Science Science
University of Alberta University of Alberta University of Alberta
Edmonton, Canada Edmonton, Canada Edmonton, Canada
[email protected] [email protected] [email protected]ABSTRACT ferent parts of a program are going to use different types
A Java project contains many different types which are de- and libraries. When writing or contributing to a project,
fined by the language, the developers and included libraries. the author’s choice of which part of the program to edit will
This paper examines how many of those types are being de- in turn affect what types they end up working with.
clared by developers in Java projects. A tool for computing There is lots of information on how types are used as a
the difference between Abstract Syntax Trees (AST) is used language feature, but there is little on how types are used
to compare revisions in GitHub repositories to find what li- by developers. There are many factors that could affect how
braries and types are contributed and changed by different many types developers use in a project. In particular, collab-
authors. From these differences, the number of type declara- orating with multiple developers could have quite an effect.
tions by an individual developer is calculated. A developer’s Work is partitioned among developers which could limit ex-
exposure to different parts of a project is measured by how posure to some types. Developers typically have a role in
many out of the total number of types they have used. A the development process that is defined by the project, their
comparison between developers is used to see if program- collaborators and their experience level [10]. A programmer
mers specialize type usage in their contributions. We find may choose to stay within their comfort areas and not use
that most projects use a large number of types and most de- certain types or leave their use to other developers. As a
velopers only declared a small portion of the total types. We developer’s experience with a project increases they may be
also propose the use of this metric to encourage developers inclined to use more types. Some types fulfill a specific task,
to increase their exposure to more parts of a project. so they may only apply to a small part of a project.
The biggest factor of an author’s type useage is likely their
activity level and involvement in a project. There are many
CCS Concepts more people watching repositories than actually contribut-
•Software and its engineering → Software organization ing to them. Increasing the activity level and encouraging
and properties; aspiring developers would be beneficial to the overall health
of the repository [10].
Keywords In this paper we study the types developers are using by
examining each revision in a repository and finding the types
Software Engineering; Mining Software Repositories; Pro- which are being declared by the developer making the revi-
gramming Languages; Abstract Syntax Trees sion. We looked at Java repositories to answer the following
research questions:
1. INTRODUCTION
In Java, a type can be a primitive type or an object refer- RQ1: Do developers declare a small subset of the
ence. Primitive types are those that are built into the lan- total types in a project?
guage such as an int or char. Object references point to in-
stances of classes in memory which are defined in the project RQ2: How many developers in each project had
or through an included library. The types in a project rep- large type coverage?
resent the functionality available for a developer to work
with. A large or complex project may have many different 2. RELATED WORKS
types, whereas a small project might only use a few. Dif-
The usage of language features in Java and the behaviour
of developers has been studied before. Robert Dyer et al.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed [2] mined AST nodes to study the use of new Java language
for profit or commercial advantage and that copies bear this notice and the full cita- features over time. They found that all new Java features
tion on the first page. Copyrights for components of this work owned by others than do get used, but not nearly as often as they could have been.
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission New features varied greatly in popularity and adoption rate.
and/or a fee. Request permissions from
[email protected]. Developers would also modify old code to use new features as
MSR16 May 14–15, 2016, Austin, TX, USA they were released. The authors also found the most popular
c 2016 ACM. ISBN X-XXXXX-XX-X/XX/XX. . . $15.00 features and how teams adopted new features. They did not
DOI: 10.1145/1235 check to see how much a developer used a feature but instead
Figure 1: Shows how many different types are Figure 2: The distribution of authors by their cov-
present in each project. erage level.
only checked to see if they used it at all. ber present in the repository. Therefore, a developer’s type
Grechanik et al. [4] examined the structure of Java pro- coverage of a project is what percent of the total types did
grams mined from 2080 programs. This paper examined the a developer add or change at least one time. This includes
breakdown of syntactic structures in open source reposito- the Java static types, other Java objects in the repository
ries, however it does not consider per author statistics. and all objects added from libraries. If a developer has de-
Meyerovich and Rabkin [7] surveyed developers and ex- clared every type involved in the project, they have likely
amined repositories to learn about language adoption and seen every part of it. Even if they are not editing every file,
usage. Developers feel that certain language features are a developer with full type coverage would have knowledge
more important than others. of all types existing in the untouched file.
Parnin et al. [8] mined repositories to see how Java gener- An Abstract Syntax Tree (AST) is a tree structure which
ics have been used in open source projects and found that represents a piece of code by breaking its syntactic con-
generic usages were often introduced by a single developer structs into a tree structure. Each node in the tree represents
in a project. Generics usage was also fairly narrow, with the a syntactically valid chunk of code which can be contained in
primary use being collecting and traversing lists of objects. a parent node or broken up into child nodes. This captures
Patrick Wagstrom et al. [10] explored the roles that de- the structure of the code in an abstract manner. When a
velopers take in a networked, social development environ- change is made to a file, the difference will also appear in the
ments like GitHub. These role were defined as their level of AST. This means that the AST of revisions in code repos-
contribution as well as the types of issues they dealt with. itories can be compared to see what structures a developer
Developers will take on multiple roles in a project and will is touching.
sometimes fulfill the same role across different projects.
Lämmel et al. [6] used ASTs to examine the usage of APIs 3.2 Data Set
in Java projects. They find what APIs are popular as well Our data set consists of 216 GitHub repositories. To en-
as if they are used in a framework like manner. sure that we only looked at Java repositories of reasonable
These studies examined at the usage of language features, size, we first queried BOA [1] for eligible repositories. We
APIs and object oriented structures but did not look at the ran our query on the 2015 GitHub September dataset. The
types used in Java projects. criteria for a repository to be accept was that it included at
least 10 Java files, 3 different committers, 30 commits, and
at least one commit from after 2014 [5]. 22475 repositories
3. METHODOLOGY fit the criteria. This allowed us to narrow our search down
In this section, we define our measure of project exposure, to sizable Java projects that were recently active. Out of
type coverage, and list the additional tools used. Next, we the possible repositories returned from BOA we randomly
explain the data collection and methodology used to mine sampled the 216 repositories and pulled them from GitHub.
the repositories.
3.3 Tools
3.1 Metrics The ASTs were generated using Spoon [9]. Spoon is a tool
In order to measure the amount of different areas that a for transforming and analysing Java source code. It breaks
developer has contributed to, or covered, we will be using code up into a meta-model such as an AST. The model con-
the number of types they have declared out of the total num- sists of three parts: structural elements, code elements and
Figure 3: Shows how common developers with above Figure 4: Shows that most projects have at least one
X% coverage levels are in a project. developer with a high coverage level.
references to program elements. The Spoon model is con- that the developer had contributed to the project for over 5
venient because it provides the structure for performing an years since the time of its inception. Working on the project
AST diff while retaining the code elements for analysis af- from the start gave them the opportunity to create many of
terwards. Spoon also preserves all type and library informa- the project specific types and use them over their 5 years of
tion, which is what we were interested in. development. Another developer in the same project with
GumTree [3] is a library which is used to compute the very low type coverage had more recent commits. According
difference between two ASTs generated by Spoon. Gumtree to the commit logs this developer appeared to be cleaning
does not handle empty files by default, so in order to com- up code and fixing bugs.
pare new files we had to modify Gumtree to consider a new
file as an empty AST. 4.1 RQ1: Do developers declare a small sub-
set of the total types in a project?
3.4 Approach
We found that among the total of 3334 developers the ma-
To determine what types an author declared in a project jority cover less than 30% of the types in the project they
we looked at all revisions in the master branch in each of contribute to. With a small group covering more than 80%
the 216 GitHub repositories. Using the Gumtree algorithm of the types. Interestingly, there were many developers who
with Spoon, we were able to generate ASTs for each of the did not make any type declarations and only made changes
Java files differences between revisions. We only looked at to other parts of Java files. There were also very few devel-
additions and modifications since deletions do not necessar- opers who had a type coverage level between 30% and 80%,
ily show that an author has used the type they are deleting. suggesting that there were not many developers with only
We then counted the number of times a type is declared in moderate involvement in a repository. They either stuck to
the ASTs. By adding up the unique types declared by each a small portion of types or utilized nearly all of them. This
author in a given project, we were able to determine the is shown in Figure 2.
total number of types declared in each project. This then
allowed us to calculate the percent of types an author has 4.2 RQ2: How many developers in each project
modified or added compared to the total number of types
in a project. Computing the ASTs of every revision was
had large coverages?
very computationally expensive which limited the number We found that on average a project has about 7 develop-
of repositories which could be analysed. ers. On average, these projects only had two developers who
modified or added over 50% of the types in the project. Fur-
thermore, on average only one developer modified or added
4. ANALYSIS AND FINDINGS over 80% of the types. This is shown if Figure 3.
We found that more than half of the projects used less Looking at Figure 4 we can see that approximately 70%
than 250 different types but there were a few projects that of the projects have at least one developer who has modi-
used thousands of different types. The number of types were fied or added over 90% of the types used in a project. We
distributed in a power-law-like fashion. found that all projects have at least one developer that mod-
As an example of a developer with high type coverage, we ified or added over 50% of the types. The highest number
identified one developer with high type coverage who worked of developers on a project was 347 but the max number
on a large project of over 4,000 total commits. We found of developers with 50% or more type coverage was only 8.
This means no large projects had many developers who had
covered all types.
[1] R. Dyer, H. A. Nguyen, H. Rajan, and T. N. Nguyen.
4.3 Discussion Boa: A language and infrastructure for analyzing
If developers were able to see their type coverage when ultra-large-scale software repositories. In 35th
contributing to a repository, it might make them more aware International Conference on Software Engineering,
of the parts of the project that they had not touched. This ICSE 2013, pages 422–431, May 2013.
could encourage them to fix bugs, add features, or make [2] R. Dyer, H. Rajan, H. A. Nguyen, and T. N. Nguyen.
changes to new areas in order to increase their type cover- Mining billions of ast nodes to study actual and
age. This could also provide developers with lower activ- potential usage of java language features. In
ity incentive to contribute more to a project to distinguish Proceedings of the 36th International Conference on
themselves. When type coverage rate drops, it would also Software Engineering, ICSE 2014, pages 779–790, New
make the developer more aware of new types and changes in York, NY, USA, 2014. ACM.
the repository that may have gone unnoticed without such [3] J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez,
a metric. and M. Monperrus. Fine-grained and Accurate Source
Code Differencing. In ASE 2014, page 11 p., France,
4.4 Threats to Validity 2014.
Threats to the validity of this project include threats of [4] M. Grechanik, C. McMillan, L. DeFerrari, M. Comi,
construct and external validity. Discriminant threats to con- S. Crespi, D. Poshyvanyk, C. Fu, Q. Xie, and
struct validity include projects that had many files that were C. Ghezzi. An empirical investigation into a
not Java source files such as HTML or Ruby files. These large-scale java open source code repository. In
projects might not accurately represent an average Java Proceedings of the 2010 ACM-IEEE International
project or behaviour of Java developers. There were also Symposium on Empirical Software Engineering and
some authors who committed a large number of Java files Measurement, ESEM ’10, pages 11:1–11:10, New York,
all at once. This suggests that they may have uploaded a NY, USA, 2010. ACM.
library and did not write the code themselves. Git settings [5] E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer,
allow for a user to set a name which is used in the revision D. M. German, and D. Damian. The promises and
info. This means that multiple users in the stats could be the perils of mining github. In Proceedings of the 11th
same author under different names. This usually happens Working Conference on Mining Software Repositories,
when the author uses multiple computers. MSR 2014, pages 92–101, New York, NY, USA, 2014.
External threats to validity include sampling exclusively ACM.
from the September GitHub dataset and the possibility that [6] R. Lämmel, E. Pek, and J. Starek. Large-scale,
there could be large projects that only use a few types how- ast-based api-usage analysis of open-source java
ever uncommon they may be. projects. In Proceedings of the 2011 ACM Symposium
on Applied Computing, SAC ’11, pages 1317–1324,
5. CONCLUSIONS AND FUTURE WORK New York, NY, USA, 2011. ACM.
[7] L. A. Meyerovich and A. S. Rabkin. Empirical
In this paper we investigated the number of types used
analysis of programming language adoption. In
in Java projects and the number of types covered by each
Proceedings of the 2013 ACM SIGPLAN International
developer. In the data provided for RQ1, we found that most
Conference on Object Oriented Programming Systems
developers declare a small percentage of the total number of
Languages & Applications, OOPSLA ’13, pages
types in a project. Developers declare a small subset of types
1–18, New York, NY, USA, 2013. ACM.
and there are many types that go unused by developers in
a project. For RQ2 we found that most repositories had [8] C. Parnin, C. Bird, and E. Murphy-Hill. Java generics
at least a few developers with high type coverage. Having adoption: How new features are introduced,
a large number of developers with high type coverage was championed, or ignored. In Proceedings of the 8th
very uncommon. Working Conference on Mining Software Repositories,
There are many more questions that can be answered by MSR ’11, pages 3–12, New York, NY, USA, 2011.
comparing differences in ASTs as well as looking at type ACM.
coverage. AST diffs could be used to look at more structural [9] R. Pawlak, M. Monperrus, N. Petitprez, C. Noguera,
differences between code revisions. The number of other and L. Seinturier. Spoon: A library for implementing
language features used by the developers could be counted analyses and transformations of java source code.
to see how much of the Java language they use and what Software: Practice and Experience, page na, 2015.
they do not know. [10] P. Wagstrom, C. Jergensen, and A. Sarma. Roles in a
The type coverage of a developer could be weighted against Networked Software Development Ecosystem: A Case
the number of bugs they commit, which could indicate that Study in GitHub, 2012.
a developer has gone beyond their expertise level. ASTs
could be used to find what structure the bugs were commit-
ted in to help the developer identify their weak areas. A
study could also be done to see if developers react positively
to knowing their type coverage rate in a repository.
6. REFERENCES