Background

In early 2018, the Library began began negotiating with ITS for use of aggregated log data from the wifi access points in the Libraries. The Library was given access to a pilot version of this data in March 2019. We’ve been working with StatLab fellows this Fall to wrangle and visualize the data.

Currently, this pilot analysis is for internal library use only. Please do not distribute further.

I ## The Data We received 2018 data in the form of half-hour snapshots of devices connected to wifi access ports in Alderman Library, Clemons Library, and Harrison-Small Library; where possible, ITS merged the device data with information about the device owner, removing names and personally identifiable information.

Initial Variables

Name (type): description

  • _time (chr): time stamp in half hour increments
  • src_ip (chr): ip address of wifi port
  • enc_mac (chr): hashed mac id of connected device (further de-identified and removed)
  • enc_user (chr): hashed user id of device owner (further de-identified and removed)
  • MBU (chr): major business unit (present primarily for employees - fac, staff, student workers - 76 response categories)
  • Department (chr): department/unit (present primarily for employees - fac, staff, student workers - 453 response categories)
  • Registrar_School (chr): school affiliation (present primarily for students, 66 response categories)
  • Display_Department (chr): department (present for students, staff, faculty, 1,234 response categories)
  • uvaPersonalAMAffiliation (chr): user relationship to uva, can contain multiple responses (e.g., student, student_worker, student_applicant, grace_student, former_student, staff, former_employee, retired, faculty, emeritus, alumni, sponsored) See here for complete list

We combined the data files (~ 0.6 GB, 14,343,883 M records/connections, representing 38,120 distinct users) into a single file, replaced the hashed user and device ids with newly generated unique ids to prevent backward identification, and added the location of the connection (Alderman, Clemons, Harrison-Small) as a field.

We supplemented these data with gate count data and with a measure for whether the relevant library was open or closed for each half hour for specific analyses.

Key Questions

  1. How many unique users are in each library across time (e.g,. by half hour, day, week, month, semester, year)? And how does this compare to gate counts?
  2. How many unique users by type of user?
  3. How long are users in the library for a given visit? And how does this vary across time (by day, week, month, semester)? How many visits do users make to a given library?

Relevant Assumptions and Caveats

User Affiliation

Users could have multiple affiliations, but these appeared to be designated without a set order (e.g., some users were students^student_worker, others were student_worker^student). We used these in combination with Registrar_School, assigned for students, and Display_Deparment and MBU (major business unit), assigned for employees, to generate a primarily affiliation.

  1. users with an assigned school in Registrar_School are categorized as students (based on the LDAP definitions),
  2. among students, those with a school designation ending in U or UN are assigned as undergraduates; the remaining students are assigned as graduate students;
  3. users with a Display_Department beginning with E are categorized as (non-student) employees, (based on the LDAP definitions)
  4. among employees, those with “faculty” listed as an affiliation are assigned as faculty; the remaining employees are assigned as staff,
  5. among employees, those assigned to “LB” in MBU are assigned “library staff”, overriding a prior assignment;
  6. users who are neither students nor employees, but with “alumni” or “former_student” listed as an affiliation are assigned “alumni”;
  7. users with missing user information assigned unknown;
  8. all remaining users are assigned “other”.

Walk-bys

We believe there are overcounts of connections generated by walk-bys, users whose devices connect briefly as they pass by the building/connections points – this becomes especially apparent in the continuation of (usually) half-hour connections during periods of building closure. We did some initial investigation into whether we could estimate this and filter these out, but ultimately did not pursue this.

Missing

The data for January through March is missing for Clemons.

Results

Next steps

Should the library decide to pursue this information further:

In the meantime, if this provokes ideas for you and you’d like to see if we could examine the data in another way, let us know.