Award Abstract # 1948244
CRII: SaTC: Automatic Generation of API to Natural Language Data Type Mappings for Developer and End User Privacy Risk Mitigation

NSF Org: CNS
Division Of Computer and Network Systems
Recipient: THE UNIVERSITY OF TEXAS AT SAN ANTONIO
Initial Amendment Date: March 25, 2020
Latest Amendment Date: May 19, 2021
Award Number: 1948244
Award Instrument: Standard Grant
Program Manager: Sol Greenspan
sgreensp@nsf.gov
 (703)292-7841
CNS
 Division Of Computer and Network Systems
CSE
 Direct For Computer & Info Scie & Enginr
Start Date: March 15, 2020
End Date: February 29, 2024 (Estimated)
Total Intended Award Amount: $175,000.00
Total Awarded Amount to Date: $183,000.00
Funds Obligated to Date: FY 2020 = $175,000.00
FY 2021 = $8,000.00
History of Investigator:
  • Rocky Slavin (Principal Investigator)
    rocky.slavin@utsa.edu
Recipient Sponsored Research Office: University of Texas at San Antonio
1 UTSA CIR
SAN ANTONIO
TX  US  78249-1644
(210)458-4340
Sponsor Congressional District: 20
Primary Place of Performance: University of Texas at San Antonio
One UTSA Circle
San Antonio
TX  US  78249-1644
Primary Place of Performance
Congressional District:
20
Unique Entity Identifier (UEI): U44ZMVYU52U6
Parent UEI: X5NKD2NFF2V3
NSF Program(s): Special Projects - CNS,
Secure &Trustworthy Cyberspace
Primary Program Source: 01002021DB NSF RESEARCH & RELATED ACTIVIT
01002122DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 025Z, 8228, 9178, 9251
Program Element Code(s): 171400, 806000
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

Since the advent of the smart phone, an increasing amount of the population has gained access to Internet-accessible software applications (apps). This, coupled with the various sensors available on mobile devices, make the general public highly susceptible to privacy risks as sensitive information (e.g., location, camera images, biometrics) may be leaked to the Internet. To help users make informed decisions about the potential privacy risks in using apps, regulators increasingly require app developers to include privacy policies communicating what information is collected or shared and how that information is used. However, even when such privacy policies are present, trust must be put in the app developers to adhere to the promises therein. Furthermore, developers are accountable for their adherence to their policies and must be confident that their privacy policies accurately represent their practices. This project aims to assist both developers and general app users in verifying the alignment of privacy policies and the apps they represent by producing an automated process for linking the semantics of language used in privacy policies with the code used to produce the apps themselves. Furthermore, the project will use this framework to generate tools for end users and developers to directly benefit from this work.

The research project aims to produce an automated process for generating mappings between code-level APIs and natural language data types using machine learning. The resulting mappings will be utilized in developer and end user tools to identify and help mitigate potential privacy leakage during development and app usage. The current state of misalignment detection between privacy policies and app code requires the manual generation of mappings from code-level Application Program Interface (API) methods to privacy-oriented natural language data types. Even for small app categories, this process can require a human to review thousands of methods and hundreds of annotations resulting in potential for inaccuracies due to fatigue and incomplete domain knowledge. APIs also change as methods are introduced and deprecated resulting in outdated mappings. These problems make it difficult to apply the framework practically as the environment continually evolves. This project will address these challenges through two contributions. First, machine learning will be applied to the mapping generation process to produce an automated, scalable method for generating code-phrase mappings for APIs as needed. This will allow for misalignment detection for API levels, methods, and app categories beyond those build in previous contributions. This automated approach will make use of a state-of-the-art pre-trained language models to detect semantic similarity between API documentation and natural language data types used in privacy policies. Second, the resulting mappings from the automated model will be applied to practical developer and end user tools to enable informed decision for privacy risk mitigation. The PoliDroid tool suite will be developed including a developer-oriented integrated developer environment plugin which detects potential unintended privacy leaks based on a privacy policy and a real-time misalignment detection tool for end users.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Hosseini, Mitra Bokaei and Breaux, Travis D. and Slavin, Rocky and Niu, Jianwei and Wang, Xiaoyin "Analyzing privacy policies through syntax-driven semantic analysis of information types" Information and Software Technology , v.138 , 2021 https://doi.org/10.1016/j.infsof.2021.106608 Citation Details
Hosseini, Mitra Bokaei and Heaps, John and Slavin, Rocky and Niu, Jianwei and Breaux, Travis "Ambiguity and Generality in Natural Language Privacy Policies" 2021 IEEE 29th International Requirements Engineering Conference (RE) , 2021 https://doi.org/10.1109/RE51729.2021.00014 Citation Details
Zhang, Xueling and Heaps, John and Slavin, Rocky and Niu, Jianwei and Breaux, Travis D. and Wang, Xiaoyin "DAISY: Dynamic-Analysis-Induced Source Discovery for Sensitive Data" ACM Transactions on Software Engineering and Methodology , 2022 https://doi.org/10.1145/3569936 Citation Details
Zhang, Xueling and Wang, Xiaoyin and Slavin, Rocky and Niu, Jianwei "ConDySTA: Context-Aware Dynamic Supplement to Static Taint Analysis" Proceedings of the IEEE Symposium on Security and Privacy , 2021 https://doi.org/10.1109/SP40001.2021.00040 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

This project aimed to close the gap between what mobile apps say in their privacy policies and what they actually do with user data, especially focusing on Android apps. The overarching aim was to develop automated methods for mapping domain-specific natural language data types used in privacy policies to Android API methods used in application source code. This endeavor was motivated by the need to verify the consistency and truthfulness of privacy policies regarding the data practices of mobile applications.

Year 1: Laying the Foundation

Initially, the project focused on identifying specific parts of app code known as API sources and sinks. These are essentially points where data can enter (source) or leave (sink) the app. Identifying these points helps in pinpointing where data might be collected or shared without clear disclosure in the app's privacy policy.  The creation of two source/sink annotations was a major activity, aiming to train a machine learning classifier for automatic mapping generation. This effort resulted in the development of tools and models that significantly improved the accuracy of source and sink classification over existing state-of-the-art solutions. The success of these developments promised to simplify the application of natural-language-based privacy analysis tools across various API versions, potentially increasing their adoption outside research contexts.

Subsequent Years: Refinement and Expansion

Over the next two years, the project evolved to include the development of a refined approach for identifying app API leaks and their alignment with privacy policies, leveraging large language models (LLMs). A notable advancement was the creation of a policy-app dataset consisting of 200 Android apps and their corresponding policies. This dataset facilitated several experiments aimed at evaluating the effectiveness of LLMs in extracting data practices from privacy policies. The project's focus on leveraging LLMs for mapping API methods to privacy policy language represented a novel approach to bridging the semantic gap between the technical functionalities of apps and their described data practices.

Final Year: Breakthroughs and Achievements

In the final year, the project achieved significant milestones, including the development of an automated method for identifying consistency between policy and code, which was not the initial specific objective but a highly desired outcome. This success underscored the project's ability to adapt and leverage recent advancements in LLMs and other methodologies to directly address the problem of policy misalignment. 

Intellectual Merit and Broader Impacts

The intellectual merit of this project lies in its innovative approach to addressing the complex challenge of aligning privacy policies with the actual data practices of mobile applications. By leveraging LLMs and developing automated methods for mapping domain-specific natural language data types to Android API methods, the project has significantly advanced the state of the art in privacy analysis tools. This work not only enhances the accuracy and efficiency of identifying privacy policy and app code misalignments but also contributes to the generalizability of privacy analysis across various API versions. Such advancements deepen our understanding of the semantic gap between natural language and application code, paving the way for future research in privacy protection technologies. 

The project's impact extends beyond academic contributions, aiming to facilitate the adoption of privacy risk mitigation frameworks from research to industry. The development of tools and methods for privacy policy and code alignment has the potential to reduce privacy risks and increase trust among end users.The broader impacts of the project are manifold, extending its benefits beyond the academic domain to societal welfare. By developing tools and methodologies that can be adopted by the industry, the project contributes to the reduction of privacy risks and the enhancement of trust among end users of mobile applications. Furthermore, the project's emphasis on training and professional development has fostered a new generation of researchers equipped with the skills and knowledge to tackle privacy-related challenges. The dissemination of research findings through conferences and publications, combined with the direct application of project outcomes in industry practices, exemplifies the project's commitment to advancing both science and society through innovation in privacy protection.

This comprehensive four-year project not only advanced the technical understanding and methodologies for assessing privacy policy and app code alignment but also contributed to the training and development of new researchers in the field. Its outcomes are poised to influence both academic research and industry practices, underlining the project's significance in the broader context of privacy protection in the digital age. 

 

 


Last Modified: 03/29/2024
Modified by: Rocky Slavin

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page