I've found Google Takeouts to silently remove old data

Crosspost from /r/DataHoarder

This is a crosspost from Reddit.

TLDR: keep/backup your old Google Takeout archives, turns out the data is not persisted across them.

So I am working on a personal project for which I am collecting all the URLs I even visited. I update them via cron from multiple sources, in particular my latest Google Takeout archive which I’m always storing on my desktop. I’ve been improving the resilience of the project in terms of making sure I don’t break URL extracion so I wrote a script to diff the extracted urls and check if any of them disappear. What I found out was that urls from takeout were in fact mysteriously disappearing.

After a bit of WTF and investigation, turned out that takeout data is not cumulative (at least for some of sources), and seems to have some sort of retention period.

Here are some of my findings:

  • archive from 20181227

      grep time_usec BrowserHistory.json | sort | head -n 3
                "time_usec": 1513604826000563
                "time_usec": 1513606272469876
                "time_usec": 1513606362996796
      oldest entry is 18 Dec 2017
    
      tidy -q -i Chrome/MyActivity.html | grep -e 'PM' -e 'AM' | tail -n 3
                Dec 18, 2017, 2:12:42 PM UTC
                Dec 18, 2017, 2:11:12 PM UTC
                Dec 18, 2017, 1:47:06 PM UTC
    
      tidy -q -i Search/MyActivity.html | grep -e 'PM' -e 'AM' | tail -n 3
                Jan 23, 2015, 7:56:09 PM UTC
                Jan 23, 2015, 7:42:42 PM UTC
                Jan 23, 2015, 7:42:41 PM UTC
  • archive from 20180623

      grep time_usec BrowserHistory.json | sort | head -n 3
                "time_usec": 1496659157550587
                "time_usec": 1496660371451340
                "time_usec": 1496661577902967
      oldest entry is 05 Jun 2017
    
      tidy -q -i Chrome/MyActivity.html | grep -e 'PM' -e 'AM' | tail -n 3 
                Jun 5, 2017, 2:43:58 PM
                Jun 5, 2017, 2:43:55 PM
                Jun 5, 2017, 2:40:28 PM
    
      tidy -q -i Search/MyActivity.html | grep -e 'PM' -e 'AM' | tail -n 3
                Aug 5, 2014, 6:19:32 PM
                Aug 5, 2014, 5:25:34 PM
                Aug 5, 2014, 5:25:32 PM
  • archive from 20170410

      grep time_usec BrowserHistory.json | sort | head -n 3
                "time_usec": 1465298229733388
                "time_usec": 1465298231949965
                "time_usec": 1465298248753114
      oldest entry is 07 Jun 2016

    that takeout doesn’t have any of MyActivity.html (Google added it later in 2017) there is Searches directory which contains some jsons back to October 2010

In summary: looks like BrowserHistory.json has got a retention of about 1 year, same for Chrome/MyActivity.html. Search/MyActivity.html has got a retention of about 3 years. It’s a mess.

Not sure if there is a similar issue with other takeout stuff (e.g. youtube watch history, shopping, location etc), so be careful if you rely on it!

I guess it was sort of a gut feeling that I was paranoid about that and was keeping some of the older archives.

Also I haven’t really found anything about this retention anywhere in google takeout FAQ. Does anyone know about it? Is it some sort of legal requirement, a bug or something else?


Discussion: