I've found Google Takeouts to silently remove old data
This is a crosspost from Reddit.
TLDR: keep/backup your old Google Takeout archives, turns out the data is not persisted across them.
So I am working on a personal project for which I am collecting all the URLs I even visited. I update them via cron from multiple sources, in particular my latest Google Takeout archive which I’m always storing on my desktop. I’ve been improving the resilience of the project in terms of making sure I don’t break URL extracion so I wrote a script to diff the extracted urls and check if any of them disappear. What I found out was that urls from takeout were in fact mysteriously disappearing.
After a bit of WTF and investigation, turned out that takeout data is not cumulative (at least for some of sources), and seems to have some sort of retention period.
Here are some of my findings:
archive from 20181227
grep time_usec BrowserHistory.json | sort | head -n 3 "time_usec": 1513604826000563 "time_usec": 1513606272469876 "time_usec": 1513606362996796 oldest entry is 18 Dec 2017 tidy -q -i Chrome/MyActivity.html | grep -e 'PM' -e 'AM' | tail -n 3 Dec 18, 2017, 2:12:42 PM UTC Dec 18, 2017, 2:11:12 PM UTC Dec 18, 2017, 1:47:06 PM UTC tidy -q -i Search/MyActivity.html | grep -e 'PM' -e 'AM' | tail -n 3 Jan 23, 2015, 7:56:09 PM UTC Jan 23, 2015, 7:42:42 PM UTC Jan 23, 2015, 7:42:41 PM UTC
archive from 20180623
grep time_usec BrowserHistory.json | sort | head -n 3 "time_usec": 1496659157550587 "time_usec": 1496660371451340 "time_usec": 1496661577902967 oldest entry is 05 Jun 2017 tidy -q -i Chrome/MyActivity.html | grep -e 'PM' -e 'AM' | tail -n 3 Jun 5, 2017, 2:43:58 PM Jun 5, 2017, 2:43:55 PM Jun 5, 2017, 2:40:28 PM tidy -q -i Search/MyActivity.html | grep -e 'PM' -e 'AM' | tail -n 3 Aug 5, 2014, 6:19:32 PM Aug 5, 2014, 5:25:34 PM Aug 5, 2014, 5:25:32 PM
archive from 20170410
grep time_usec BrowserHistory.json | sort | head -n 3 "time_usec": 1465298229733388 "time_usec": 1465298231949965 "time_usec": 1465298248753114 oldest entry is 07 Jun 2016
that takeout doesn’t have any of MyActivity.html (Google added it later in 2017) there is
Searchesdirectory which contains some jsons back to October 2010
In summary: looks like
BrowserHistory.json has got a retention of about 1 year, same for
Search/MyActivity.html has got a retention of about 3 years. It’s a mess.
Not sure if there is a similar issue with other takeout stuff (e.g. youtube watch history, shopping, location etc), so be careful if you rely on it!
I guess it was sort of a gut feeling that I was paranoid about that and was keeping some of the older archives.
Also I haven’t really found anything about this retention anywhere in google takeout FAQ. Does anyone know about it? Is it some sort of legal requirement, a bug or something else?