Statistics

Events

The usage statistics of the Zenodo records are generated from two different types of events:

  • record-view: this event is related to the view of a record and it’s bounded to the record_viewed signal, which is emitted when one of the following endpoints is accessed:

    • /record/<recid>
    • /record/<recid>/export/<format>
  • file-download: this event is related to the file download from a record and it’s bounded to the file_downloaded signal, which is emitted when one the following endpoints is accessed:

    • /record/<recid>/files/<filename>
    • /api/record/<recid>/files/<filename>

Every time a user views a record or downloads a file, the corresponding signal is emitted and the corresponding event is created. In case an event fails to be emitted due to an exception raised in an event builder or to a failure in sending the message to RabbitMQ, the request is not affected and a warning message is logged.

At the creation time each event keeps track of the following fields:

  • record_id
  • recid
  • conceptrecid
  • doi
  • conceptdoi
  • access_right
  • resource_type
  • communities
  • owners
  • timestamp
  • referrer

We also capture the following user related fields for a limited period of time, i.e. until the event is processed and the data are anonymized:

  • ip_address
  • user_agent
  • user_id
  • session_id

There are also some fields which are event specific. For the file-download event we track:

  • bucket_id
  • file_id
  • file_key
  • size

While for the record-view event we track:

  • pid_type
  • pid_value

After an event is created, it is sent to a dedicated RabbitMQ queue. Each type of event has its own queue, so different types of events can be processed separately. The events which are in the queue are then consumed and processed by a Celery task. In this step the flags is_machine and is_robot are added, the user is anonymized and the double-clicks are deleted. During the data anonymization process the user related fields are deleted and replaced by the visitor_id and the unique_session_id. In case an event fails to be processed (e.g. malformed IP address), the event is skipped/lost and a warning message is logged. After the processing, the robot events are discarded for space savings, since they are not relevant for the statistics. All of the other processed events are saved in Elasticsearch.

Aggregations

All the events generated by the record views or downloads are aggregated in several ways to produce daily statistics. Here you can find the list of the different types of aggregations used by Zenodo:

  • record-view-agg: this aggregation is applied to the record-view events and it calculates the daily views and unique views of a specific version of a record;
  • record-view-all-versions-agg: this aggregation is applied to the record-view events and it calculates the daily views and unique views of all versions of a record;
  • record-download-agg: this aggregation is applied to the file-download events and it calculates the daily downloads and unique downloads of a specific version of a record;
  • record-download-all-versions-agg: this aggregation is applied to the file-download events and it calculates the daily downloads and unique downloads of all versions of a record.

Both the record-view-agg and the record-view-all-versions-agg are applied to the same record-view events and the aggregate documents they produce are stored in the same indices/alias (stats-record-view). The difference between these two aggregations is that, while the first one aggregates the events by recid, the second one does the aggregation by conceptrecid. This leads to two different results: in the first case we have the statistics for a single version of a record, while in the second case we have the statistics for all the versions of a record.

For example, let’s say that we have the following record-view events:

{
    "timestamp": "2018-07-20 17:30:00",
    "recid": "123456",
    "conceptrecid": "78910",
    ...
}

{
    "timestamp": "2018-07-20 17:45:00",
    "recid": "123456",
    "conceptrecid": "78910",
    ...
}

{
    "timestamp": "2018-07-20 18:20:00",
    "recid": "26245",
    "conceptrecid": "78910",
    ...
}

The result of the record-view-agg will be two documents, one for each version of the record:

{
    "timestamp": "2018-07-23 10:30:00",
    "count": 2,
    "unique_count": 1,
    "recid": "123456",
    "conceptrecid": "78910",
    "is_parent": False,
    ...
}

{
    "timestamp": "2018-07-23 10:30:00",
    "count": 1,
    "unique_count": 1,
    "recid": "26245",
    "conceptrecid": "78910",
    "is_parent": False,
    ...
}

The result of record-view-all-versions-agg will be one document which summarize the statistics of both versions of the record:

{
    "timestamp": "2018-07-23 10:30:00",
    "count": 3,
    "unique_count": 2,
    "recid": "78910",
    "conceptrecid": "78910",
    "is_parent": True,
    ...
}

The same happens for the record-download-agg and the record-download-all-versions-agg, which are applied to the file-download events and end up in the stats-file-download indices/alias.

In order to count the total number of unique views (and unique downloads) of a record, it’s necessary to identify each 1 hour user-session with a unique id, called unique_session_id. All the views (and all the downloads) made from the same user within the same one hour session have the same unique_session_id. In this way we can easily count the total number of unique views (or unique downloads) of a record as the cardinality of the unique_session_id present in the events related to the record.

All the new aggregations are registered via the register_aggregations method. The aggregation task runs every hour and takes the events from Elasticsearch.

Queries

Metrics for each recid and conceptrecid are aggregated and stored in “daily” documents. For example a record with recid: 12345, will have documents like:

[
  {
    "_id": "12345-2018-01-01",
    "_index": "stats-record-view-2018-01",
    "_source": {
      "timestamp": "2018-01-01T00:00:00",
      "recid": "12345", "record_id": "5bad6b11-84ed-4946-86a9-2b614a63d2b4",
      "communities": ["biosyslit"],
      "count": 20, "unique_count": 15,
    }
  },
  {
    "_id": "12345-2018-01-02",
    "_index": "stats-record-view-2018-01",
    "_source": {
      "timestamp": "2018-01-02T00:00:00",
      "recid": "12345", "record_id": "5bad6b11-84ed-4946-86a9-2b614a63d2b4",
      "communities": ["biosyslit"],
      "count": 40, "unique_count": 30
    }
  }
]

Although that representation would be useful to display a histogram, it’s obviously not very convenient to generate yearly or all-time statistics for a record. Invenio-Stats solves this by allowing to perform preconfigured queries to Elasticsearch which further aggregate metrics over periods of time by filtering and performing Metrics Aggregations.

The configured queries that are defined in zenodo.modules.stats.registrations are:

  • record-view: View statistics for specific record versions.
  • record-view-all-versions: View statistics for all versions of a record.
  • record-download: Download statistics for specific record versions.
  • record-download-all-versions: Download statistics for all versions of a record.

These queries are exposed via a REST API which is accessible, only for users with the admin-access permission, at /api/stats.

Records integration

While using Queries is enough to fetch individual record statistics, this is not an optimal solution for the most common use-cases. Making an Elasticsearch query everytime we want to display the total views, downloads, etc. of a record and all of its versions puts a lot of strain on Elasticsearch.

Another use-case is that we want to sort records by views in search results. Since there is no way of doing an SQL-like JOIN in Elasticsearch, so that we could join the records and some aggregation of the stats-record-view indices (though even if there was a way, that doesn’t sound very efficient), there’s only one solution left: to include the statistics inside the record’s indexed document.

Because of the above use-cases, we introduced in the records Elasticsearch mapping a _stats field. Every time a record is indexed (either through normal or bulk indexing), this field is being built by performing the necessary sub-queries to Elasticsearch, in order to fetch the all-time statistics of the record. These are:

  • views & unique_views
  • downloads & unique_downloads
  • volume & version_volume
  • version_views & version_unique_views
  • version_downloads & version_unique_downloads

Now that this pre-calculated information is part of the record index, we can use it in the following places:

  • For sorting search results (e.g. sort: '_stats.version_views')
  • At the record’s page, i.e. in the statistics box in the sidebar
  • At the record’s REST API responses and other serialization formats

Note

This means that rendering a record’s page or serializing a single record now also depends on having both the database and Elasticsearch up and running to get a complete representation. Since statistics are obviously not as critical as the actual record’s metadata, failure to fetch a record from Elasticsearch will not raise an exception.

Now that we know how to make the statistics of a record available, we have one final problem to solve: we need to keep the statistics updated! Although records are indexed from time to time because of user or system initiated editing/publishing, there has to be a regular updating mechanism that indexes records that might not have been necessarily “touched”, but just “viewed” or “downloaded”. The zenodo.modules.stats.tasks.update_record_statistics Celery task is responsible for this job. It checks which records’ statistics have been affected by Aggregations via checking the last two bookmarks created by each aggregation. Since these bookmarks’ granularity is daily, we can only send a maximum of 1-2 days worth of affected records for bulk indexing every time the task runs. The task is kicked-off multiple times during a day by Celery Beat.