The collection and analysis of user-level library data was a central theme at the 2014 ARL Library Assessment Conference (LAC). This is perhaps not surprising; the recent ACRL Value of Academic Libraries Report emphasized the need for libraries to systematically collect and incorporate large-scale data into their assessment activities, and many libraries (including my own) are actively seeking ways to use data about library services and collections in institutional efforts to better understand and measure student learning and engagement. Each of the three LAC keynote presentations discussed and demonstrated how library data might be utilized in various approaches to educational analytics in ways that were at once exciting and frightening (unfortunately at the time of writing these presentations are not yet online). These presentations provoked numerous Twitter discussions, particularly surrounding issues of privacy and the ethics of collecting “big data” about our users.

My fortune cookie on the way home from the ARL Library Assessment Conference underscores the importance of data ethics.
During this discussion, I commented that I felt that there was a somewhat laissez faire attitude towards user privacy in the LAC presentations, especially with regard to the creation of large systematic datasets. Some of my colleagues suggested that I was being uncharitable, pointing out that just because privacy protections weren’t discussed during a presentation didn’t necessarily mean that they weren’t in place. Some of this tension is certainly a result of what Barbara Fister identifies as the two models of research presented to librarians, “the scholar’s way and the corporate way,” and their very different approaches to human subjects research (this tension was also readily apparent in presenters’ choices of models and metaphors throughout the conference). Nevertheless, it seemed clear to me that our libraries’ technical capabilities have outpaced the sophistication of our ethical conversation surrounding this type of data collection.
In the final keynote, David Kay, a consultant from UK-based Sero Consulting, went so far as to recommend that libraries collect and indefinitely retain transaction-level usage data from throughout their various systems at the level of individual users. Just to be clear on what this might look like: a dataset containing every search, every resource used, every item record viewed, every article browsed or downloaded, and every book checked out by every library user. This data would also be collected in such a way that it could be linked to other institutional data such as demographic information, financial aid, measures of student and faculty success (engagement, retention, GPA, grant funding, publication records), and really anything else that a university might track at an individual level. The creation of this type of dataset is already within our libraries’ technical capacities, and quite a bit of it probably already exists within our server logs. In fact, I’ve had serious discussions about how we might collect this kind of data at my own library (although we have not yet done so).
Before creating these types of datasets, we have an ethical obligation to evaluate their potential risks and benefits to the people whose information they contain.
There are numerous potential benefits to mining large library datasets. To name a few: providing better services to our students, faculty and public users, making efficient use of our funds, improving management of our collections, demonstrating the role of libraries in student learning and faculty research, and identifying students at risk of dropping out or in need of additional help. All of these are important and laudable goals, and I do think that much of the push towards collecting more and more fine-grained educational data is motivated by a desire to help students (although we should also be very aware of the financial and ideological interests at play).
Despite these potential benefits, difficulties soon arise when we begin to evaluate the potential risks of these types of data. While they might be initially created for benign purposes, as I have argued elsewhere, one problem with creating and storing these data –especially if it is stored in perpetuity–is that it is extremely difficult to evaluate what it might be used for, or more critically, what it might be misused for. At least initially, much of the analysis and insight listed above would require identifiers to be collected that would allow the data to be associated with individuals. Imagine a datset that contains every topic you ever searched as an undergraduate or graduate student. Would you want this dataset to be available for data mining and analysis? Would you trust your university never to sell it or otherwise disclose it? What if the university decided to provide it to potential employers as evidence of a student’s “preparedness?” What if it became part of political proceedings? Or a civil or criminal suit?
Once a dataset exists it is subject to subpoena by law enforcement (IRB consent forms usually warn about legally required disclosure). There is no researcher-subject privilege and we should not assume that our universities will be willing or able to resist a subpoena. Recent events surrounding the Boston College Belfast Project, in which oral history interviews that were intended to remain secret until after a research participant’s death were subpoenaed as part of a British Government murder investigation, have amply demonstrated the potential risks of archived datasets. Given the FBI and other investigative agencies’ previous interest in library data, it is not difficult to imagine myriad scenarios in which transaction data might be requested. In my opinion, these types of large datasets therefore represent a significant risk for unintended use and disclosure.
Removing identifying information from stored datasets is often used as a strategy to mitigate or eliminate risks from unintended disclosure. Unfortunately, even de-identifying a dataset by destroying the links between individually identifying information and other data is possibly insufficient to protect research subjects’ privacy. Re-identification of research subjects from information contained in datasets is decreasing in difficulty, a task that would be made even easier in the case of a library user dataset since the source population would be known (i.e. the students enrolled at a university during particular dates). For this reason, a dataset containing even rudimentary demographic data about students, (such as major, graduation year, sex, ethnicity, etc.), might be impossible to de-identify (I’ve discussed this further in a report on data stewardship practices. See p. 94).
Because of these unknown risks, obtaining consent is another major problem with this collecting these types of data. Like most libraries, my library obtains passive consent for our routine user data collection via our library privacy policy. This covers data collection for the mission of the library, but doesn’t specify if linking to other institutional data falls under this mission (incidentally, our policy also explicitly forbids linking user accounts with specific items and records). There is also not an effective opt-out procedure to this policy except for users to refrain from using the library systems that require a log-in. While this is the form of consent most web services use for their data collection (via their Terms of Service Agreements), this would almost certainly be viewed as coercive by an IRB.
I don’t presume to have immediate answers to all these issues, but in the interest of moving the conversation forward, I will suggest a few possible guidelines for consideration:
- Data should be aggregated at a level that balances analytical specificity with user privacy. For example, electronic usage data might be collected at the resource level rather than the item level, or circulation data at the LC classification level.
- Transaction level data that identifies both user and item should be avoided unless required for a specific and limited purpose. Systematic collection of this data should not be conducted. If this data is required, additional measures such as local encryption of the files should be used to protect individuals’ privacy.
- Datasets containing user demographic data should not be retained indefinitely and should be destroyed after a reasonable period following the completion of data analysis.
- Consent procedures should be reviewed before data collection, and procedures to provide opt-out and/or explicit consent should be developed when necessary.
- Libraries should hold our vendors to the same data ethics standards that we adhere to. Moreover, we should not purchase or otherwise use data that does not meet our ethical standards.
- The ALA should consider adding statements on research ethics and data ethics to its Code of Ethics.
Unlike Google and other web services, our business model does not require us to turn our users into commodities. We should continue hold ourselves to a higher standard.