Rewrite hit counting to work with nginx caching

Description

Background

We're going to start caching works for logged out users, which means whenever we serve up a cached work, we'll be skipping the code that triggers an increase in the work's hit count. That's not good!

To get around this, we will use JS to count hits, whether or not the page is cached, and whether or not the user is logged in. And since we are already here, we might as well rewrite how we count hits, to address AO3-4751.

How hits should work:

You go to any chapter of a work. One hit is added. From there, you navigate to a different chapter of the work. No hits are added. You close the browser. Later or tomorrow, you come back to the work. One hit is added. You move through the work and again no hits are added.

Implementation

  • Add JS on individual work/chapter pages to make an AJAX request to a new endpoint for logging hits. This JS code will be the same for logged out and logged in users. Hit counts will no longer be incremented as part of the work/chapter rendering.

  • Log hits into Redis, then have background tasks periodically move new hit data into the database (the stat_counters table). This approach for database updates hasn't changed.

  • De-duplicate hits on the same work (regardless of chapters), from the same visitor, within the same period of time (e.g. 24h). When logging a hit, save to Redis: the work ID, the visitor IP, and the time.

  • Use Redis sets (tentative plan, see also the actual implementation which combines some of these sets).

    • Sets (A) with (work ID + timestamp) key, containing IPs of all visitors on that work within the time period identified by the timestamp.

    • Sets (B) with (work ID) key, containing the names of all the existing (A) sets. We use them to clean up outdated (A) sets.

    • Sets (C) with (work ID) key, containing the count of IPs added since the last time we move hits to the database. When moving hits to the database, take only the unique count of IPs. The actual IPs remain ephemeral in Redis.

    • A single set (D) of all work IDs with changed hits.

  • For the timestamp in the key of (A), use a configurable value that stays the same for 24h. For example, if we use "the most recent 3am", 3am is the fixed rollover time. After 3am, the same visitor on the same work can generate a new hit.

  • Move hit counts to the database more frequently than the de-duplicating time period (24h). Whenever a new, unseen IP is added to (A) by SADD, we increment the counter in (C) by SADD's return value. This way users don't have to wait 24h before seeing hits change.

    • If (A) is a new set, we need to add it to (B). The work ID should also be added to (D). We should use Redis transactions to make sure all writes succeed.

  • Have the background task (run more frequently than 24h) do 2 things:

    • Get works with changed hits from (D), add their counts from (C) to the database. Updating StatCounter models will automatically trigger reindexing on works. Empty (D) and reset (C) to 0.

    • Iterate through (B) and delete outdated (A) sets (e.g. more than 24h old).

Notes

Things we can remove

  • The unused LogfileReader module for parsing nginx logs, and the module's tests.

  • The last_visitor column on the stat_counters table.

  • The last_visitor_old and hit_count_old columns on the works table.

Cons of JS-based hit counters

Counting hits from nginx (alternative)

We considered looking at the nginx logs to determine how many hits to add to the work. We would increase the hit count by one every time we serve the actual work (not the adult content warning) to a particular remote address, provided the previous hit was not also from that address.

Reference: Keeping Hitcounts accurate when using an NGinx Caching Proxy.

We decided against this because making our application code and our nginx setup highly dependent on each other will be difficult to maintain in the long term, especially as the nginx configuration is not currently tracked in the public repository.

Testing

1. Make sure that the owner's hits don't count:

  • Find a work that you created, but haven't visited recently.

  • View it multiple times logged-in as the creator. Note the hit count in the header.

  • Wait at least 30 minutes for the stats to update.

  • View the work and verify that the hit count hasn't increased.

2. Make sure that logged-in hits count:

  • Find a work that you haven't visited recently.

  • Log in as a user unrelated to the work.

  • View the work multiple times. Note the hit count in the header.

  • Wait at least 30 minutes for the stats to update.

  • View the work and verify the hit count has increased by 1.

3. Make sure that logged-out hits count:

  • Find a work that you haven't visited recently.

  • View it multiple times logged-out. Note the hit count in the header.

  • Wait at least 30 minutes for the stats to update.

  • View the work and verify the hit count has increased by 1.

4. Make sure that cached hits count:

  • Find a work that you haven't visited recently.

  • Get two devices with different IP addresses ready (or one device with a proxy).

  • While logged-out, view the work on the first device (to generate the cache). Note the hit count in the header.

  • While logged-out, view the work on the second device.

  • Wait at least 30 minutes for the stats to update.

  • View the work again and verify that the hit count has increased by 2.

5. Make sure that interleaving multiple views only counts twice:

  • Find a work that you haven't visited recently.

  • Get two devices with different IP addresses ready (or one device with a proxy).

  • Make sure you're logged out (or logged in as a user other than the owner).

  • View the work on the first device.

  • View the work on the second device.

  • View the work on the first device again.

  • Wait at least 30 minutes for the stats to update.

  • View the work again and verify that the hit count has increased by 2.

6. Make sure that the rollover happens properly:

  • Find a work that you haven't visited recently.

  • Make sure you're logged out (or logged in as a user other than the owner).

  • View the work slightly before 3 AM UTC. Note the hit count in the header.

  • View the work again slightly after 3 AM UTC, with the same IP address.

  • Wait at least 30 minutes for the stats to update.

  • View the work again and verify that the hit count has increased by 2.

7. Make sure that viewing later chapters increments the hit count.

  • Find a multichapter work that you haven't visited recently.

  • Make sure you're logged out (or logged in as a user other than the owner).

  • Copy the work URL and add /navigate to the end of the path to view the chapter index.

  • Click on one of the later chapters. Note the hit count in the header.

  • Wait at least 30 minutes for the stats to update.

  • View the chapter again and verify that the hit count has increased by 1.

8. Re-test

Post-deploy

Staging

  • Remove cron jobs that run rake statistics tasks.

  • Run bundle exec rake statistics:update_stat_counters one last time using an old version of the application code.

  • Run bundle exec rake After:remove_old_redis_hit_count_data to clean up all traces of old hit counting in Redis.

  • Run all 3 migrations at leisure.

Beta

  • Remove cron jobs that run rake statistics tasks.

  • Run bundle exec rake statistics:update_stat_counters one last time using an old version of the application code.

  • Run bundle exec rake After:remove_old_redis_hit_count_data to clean up all traces of old hit counting in Redis.

  • Run all 3 migrations at leisure.

Activity

Show:
teyla
April 19, 2020, 9:23 PM

1: https://test.archiveofourown.org/works/9437 - visited my own work while logged in. Hits: 38. On check: 38

2: https://test.archiveofourown.org/works/1051437/chapters/2103852 - visited someone else’s work while logged in. Hits: 18. On check: 19

3: https://test.archiveofourown.org/works/1050060 - visited a work while logged out. Hits: 34. On check: 35

4: https://test.archiveofourown.org/works/1025857 - while logged out, visited a work normally, then with VPN. Hits: 23. On check: 25

5: https://test.archiveofourown.org/works/1026667 - while logged out, visited a work normally, then with VPN, then again normally. Hits: 27. On check: 29

6: Wrong time to test.

7: https://test.archiveofourown.org/works/191488/chapters/282055?view_adult=true - while logged out, visited a late chapter via chapter index. Hits: 868. On check: 869

8: https://test.archiveofourown.org/collections/testcoll/works/1071143 - created a work, posted it to an unrevealed collection, and created a hit using VPN. Hits: 0. On check: 0

Sarken
April 20, 2020, 3:37 AM

6. Viewed https://test.archiveofourown.org/works/673979 at 02:44 UTC, logged in. It had 112 hits. I went to it again at 03:03 UTC and it had 113 hits. Went back to the work at 03:35 and it had 114 hits. Seems okay!

Sammie Louise
April 20, 2020, 4:01 AM
Edited

6. Also tested this by visiting http://test.archiveofourown.org/works/1058482 at about 02:46 UTC. whilst logged out. It had 9 hits. Went back again a few minutes after 03:00 UTC and it still had 9 hits. Half an hour later, the work blurb was showing 11 hits and by 04:00 UTC the work listing was also showing 11 hits. Looks good

james_
April 21, 2020, 7:56 PM

 

 

redsummernight
April 22, 2020, 9:11 PM

Notes from james_, during the production deploy:

Migration (admin_settings):

Migration (stat_counters):

Migration (works):

Assignee

ticking instant

Reporter

Sarken

Roadmap

Visitors
Works

Priority

High

Affects versions

Fix versions

Components

BackEnd

Difficulty

Hard

Milestone

Internal 0.9
Configure