cannon

⭐ Cannon

Cannon is an idea for a project attempting to compute canonical/normalised URLs and extract some information from them ('entities'), merely by looking at the URL, and ideally without using the "rel=canonical" metadata.
I describe the problem it tries to solve here: "urls are broken", also see "motivation".

At the moment it's a subproject of promnesia: see cannon.py
and tests/cannon.py.

If anyone knows of similar efforts/prior art, please let me know! I'd really like to avoid reinvening the wheel here.

Table of Contents

related cannon

promnesia as a primary application (for me) cannonpromnesia

. cannonlinkrot

[A] * motivation cannon

Once you are sold on motivation in this section, and wondering why would this require a separate library/database, check out "testcases" section.

[A] [2020-04-04] I want urls that represent information, regardless the way it's presented cannon

let alone all the tracking/etc crap

[B] [2020-05-23] "document equivalence" is a good term: How to establish (or avoid) document equivalence in the Hypothesis system : Hypothesis cannon

[B] why not use "rel=canonical" metadata field? cannon

[B] [2019-10-11] mobile versions of sites sometimes have different "canonical", e.g. mobile.twitter.com cannon

No one would argue that a tweet is the same regardless where it's presented, yet there is no easy way to unify this

[C] [2020-05-28] archive.org is messing with canonical cannon

[D] [2019-11-02] e.g. this link doesn't have 'canonical' even though it's a mirror: https://solar.lowtechmagazine.com/2016/11/the-curse-of-the-modern-office.html cannon

[D] [2019-11-08] no canonical on gist https://gist.github.com/dneto/2258454 cannon

same as https://gist.github.com/2258454 – hmm, this thing redirects now..

[B] [2019-08-19] parent and sibling relations can be determined from the URL cannonpromnesia

e.g. subreddit-post/user-comment/user-tweet, etc.

[B] [2019-11-01] if the original page is gone I can still easily link my saved annotations (Instapaper/Pocket/Hypothesis) to archived page cannon

[B] [2019-09-07] urls a good candidate to determine 'entities' because they sure at least somewhat curated cannon

[C] [2019-02-24] normalization is tricky.. for some urls, stuff after # is important https://en.wikipedia.org/wiki/Tendon#cite_note-14 . for some, it's utter garbage cannon

however we can sort of get away with normalizing on server only?

DONE [C] [2019-08-07] The Problem With URLs https://blog.codinghorror.com/the-problem-with-urls/ cannon

  • [2019-08-27] not very insigntful, example of msdn with weird characters in urls

[C] [2020-01-02] motivation: siloing: instapaper 'imports' pages and assigns an id: https://www.instapaper.com/read/1265139707 cannon

so you can't connect your annotations on instapaper to notes etc

[C] [2021-03-07] could normalize historic URLs which are already down? cannonlinkrot

perhaps not super useful if we can't access them, but still

[A] * projects that could benefit from it cannon

Apart from Promnesia, I believe it could be quite useful for other projects.

STRT [B] [2019-06-27] Hmm could be helpful for hypothesis? cannonhypothesis

  • [2020-04-29] write about it? the future?

NEXT [B] [2021-01-16] discuss about cannon (maybe on Slack)? cannonhypothesis

[C] [2019-05-24] Annotation of content on sites like Facebook or Twitter? - Google Groups cannonhypothesis

kinda related since they basically want canonical urls

TODO [B] [2021-01-30] Ignore URL parameters - Feature Requests - Memex Community cannonworldbrain

TODO [C] [2021-01-22] wonder if we could cooperate? cannonagora

TODO [C] [2021-01-24] would be useful to use the same normalising engine for #archivebox for example? cannonwebarchive

  • [2021-03-10] although I guess it needs to fetch the page anyway so "rel=canonical" works we ll enough

TODO [C] [2021-02-07] could be useful for surfingkey/nyxt browser to hint 'interesting' urls? cannon

STRT [C] [2019-12-26] archive.org cannonlinkrot

e.g. if the link is not present in archive.org, it doesn't mean it's not archived under a different canonical

TODO [C] if it's implemented as a helper extension/library, it could be useful for many other extensions cannon

CREATED: [2020-11-17]

e.g. blockers, various highlighters, hypothesis, etc

TODO [D] [2020-11-20] could reuse URL underlying etc with ampie? cannonampie

[A] * prior art cannon

URL normalization algorithm should be shared with other projects to the maximum extent possible.
If not the exact algorithm, at least the 'curated' parts of it like regexes, testcases, etc should be shared.
It's a crap boring work that should be only done once (e.g. like timezones database).

TODO [A] [2020-06-30] ClearURLs / Addon: looks super super promising cannon

Once ClearURLs has cleaned the address, it will look like this: https://www.amazon.com/dp/exampleProduct

[2021-03-10] https://github.com/ClearURLs/Addon/wiki/Rules: Not super convinced JSON would work well in general, but anyway it's already pretty good. cannon

TODO [B] [2019-07-09] h/uri.py at 0fc8a0d345741d43b4f80856a7cbb8f5afa70f80 · hypothesis/h https://github.com/hypothesis/h/blob/0fc8a0d345741d43b4f80856a7cbb8f5afa70f80/h/util/uri.py cannonhypothesis

[2019-07-09] excluded query params! cannonhypothesis

[2019-07-09] right, I could probably reuse hypothesis's canonify and contribute back. looks very similar to mine cannonhypothesis

TODO [B] [2020-05-12] coleifer/micawber: a small library for extracting rich content from urls cannon

[2021-03-10] ok, pretty interesting. it probably uses network, but could at least use it for testing (or maybe even 'enriching'?) cannon

TODO [C] [2019-03-27] sindresorhus/compare-urls: Compare URLs by first normalizing them cannon

compareUrls('HTTP://sindresorhus.com/?b=b&a=a', 'sindresorhus.com/?a=a&b=b');

[C] [2019-12-25] sindresorhus/normalize-url cannon

stripWWW can't handle amp etc

TODO [C] [2019-07-09] hypothesis: h/normalize_uris_test.py cannon

[B] * ideas cannon

[B] [2021-03-07] maybe we can achieve 95% accuracy with generic rules and by handling the most popular websites cannon

for the rest

  • allow user to customize
  • allow user to submit normalization errors (where?)

TODO [B] if 'children' relations can't be determined by substring matching, perhaps cannon should generate 'virtual' urls? cannonpromnesia

CREATED: [2019-10-13]

TODO [B] a special service to resolve siloed links like t.co ? cannonlinkrot

CREATED: [2020-04-29]

Could also be useful for Archive.org/archivebox/etc. But a bit out of scope for this project..

STRT [B] just specify admissible regexes for urls so it's easier to unify? cannon

CREATED: [2019-11-08]

e.g. twitter.com/user/status/statusid

maybe normalise to this?
twitter.com/i/web/status/1053151870791835649

reddit.com/comments/5ombk8 – huh, normalise to this?
TODO m.readdit/old.reddit

en.m.wikipedia/ru.m.wikipedia
maybe stripp off subdom completely?

youtube.com/watch?v=xAy—wpDQ&list=PL0kyDgrqAiUEF5d7krLIds1ebhTxCjm&shuffle=221
youtube.com/watch?v=Woa3MPijE3s&list=PL0kyDgrqAiXKspaa1GIS0jbbLrsAa3sk&spfreload=10

[2019-11-09] also this to summarize cannon

sqlite3 promnesia.sqlite 'select domain, count(domain) from (select substr(normurl, 0, instr(normurl, "/")) as domain from visits) group by domain order by count(domain)'

STRT [B] rethinking the whole approach… cannon

CREATED: [2020-11-15]

consider https://www.youtube.com/watch?v=wHrCkyoe72U&list=WL
basically

  • cut of protocol just merely for simplicity? I guess makes everything much easier
  • the result is always 'composed of' inputs. e.g. maps to youtube/wHrCkyoe72U, both parts are in the original link
    might not be the case if domain names are remapped though.. e.g. youtu.be
  • sort query parts alphabetically
    (although might make sense to make it hierarchy aware?)
  • treat parts & query the same way, parts are query with None keys
  • to handle domain names better, replace dots before first / with : e.g. www.youtube.com -> www/youtube/com
    then cat treat the same way as subpaths
    i.e. we get
    None www | drop
    None youtube | keep
    None com | drop
    None watch | drop
    list WL | keep? – actually this could be considered a 'tag'? unclear
    v wHrCkyoe72U | keep

ok so how do we generalize from two examples?
e.g. say we also have
youtube.ru/watch?v=abacaba -> youtube/abacaba
we get
youtube | keep
ru | drop
watch | drop
v abacaba | keep
I suppose it could guess that if we keep a query parameter once, we'll keep it always?
and if we extracted a certain substring without a query parameter, we'll also always keep it as is?

TODO how about this?
https://news.ycombinator.com/reply?id=25100810&goto=item%3Fid%3D25099862%2325100810
it's a reply to https://news.ycombinator.com/item?id=25100035
which is a comment to https://news.ycombinator.com/item?id=25099862

TODO [C] use shared JS/python tests for canonifying? cannonffipromnesia

CREATED: [2020-11-12]

TODO [C] [2019-09-03] should be idempotent? cannon

TODO [C] hmm, maybe the extension can learn normalisation ruls over time? by looking at canonical and refining the rules? cannon

CREATED: [2020-12-20]

TODO [C] sample random links and their canonicals for testing cannon

CREATED: [2020-12-20]

TODO [C] background thing that sucks in canonical urls and provides data for testing? cannonpromnesia

CREATED: [2020-05-12]

TODO [C] how do we prune links that are potentially not secure to store? like certain URL parameters cannon

CREATED: [2020-05-20]

TODO [D] need checks that url don't contain stupid shit like trailing colons etc cannon

CREATED: [2019-02-24]

TODO [C] hmm could use this api for checking normalization? cannon

CREATED: [2021-03-21]
http get 'http://archive.org/wayback/available?url=https://stackoverflow.com/questions/1425892/how-do-you-merge-two-git-repositories'
{
    "archived_snapshots": {
        "closest": {
            "available": true,
            "status": "200",
            "timestamp": "20210219235548",
            "url": "http://web.archive.org/web/20210219235548/https://stackoverflow.com/questions/1425892/how-do-you-merge-two-git-repositories"
        }
    },
    "url": "https://stackoverflow.com/questions/1425892/how-do-you-merge-two-git-repositories"
}

[C] * testcases cannon

Some tricky cases which would be nice to get right

[B] [2020-11-15] Wendover Productions - YouTube cannon

[B] [2020-04-19] roam links cannon

STRT [B] [2019-06-23] A Brief Intro to Topological Quantum Field Theories. - YouTube https://www.youtube.com/watch?v=59uLGIrkMxM&list=WL&index=61&t=0s cannon

eh, rules might be a bit complicated. E.g. if both v and list are present, we wanna ditch list, otherwise keep list

TODO [B] [2020-11-16] normalise DOI cannon

Ah sure: This DOI: https://doi.org/10.1073/pnas.1211902109  should lead to this paper: https://pnas.org/content/109/48/E3324 .

TODO [C] m.wikipedia normalisation could also be useful for hypothesis? cannonhypothesis

CREATED: [2019-07-23]

[2019-07-23] X.m.wikipedia.org cannonhypothesis

[2019-07-23] mm, it's got canonical though.. cannonhypothesis

TODO [2019-07-23] perhaps promnesia should respond both to canonical and its own idea of normalised (preferring canonical) cannonhypothesis

STRT [C] [2019-04-20] fragments: Aharonov-Bohm Experiment https://physicstravelguide.com/experiments/aharonov-bohm#tab__concrete cannon

url normalising… this is an example where fragments are important

[2019-08-26] here I guess it could yield url with hash + parent url? cannon

TODO [2019-08-26] always assume that parents in uri hierarchy are actual parents? I guess that's fairly reasonable cannon

[C] [2019-08-25] stuff like this: youtu.be/1TKSfAkWWN0 cannon

[2019-08-25] this is also motivation for canonifying. this is a redirect link in tweet, and there is no way to associate it with canonical cannon

[C] [2020-05-02] https://hubs.mozilla.com/#/ cannon

[C] [2020-04-30] Writing well | defmacro cannon

support for archive.org and test on this page

[C] [2019-11-15] maybe https://youtu.be/zRxI0DaQrag?t=1380 ? cannon

[C] [2019-11-09] github: https://twitter.com/i/web/status/928602151286386688 this end up trimmed with … :( cannon

[C] [2021-01-24] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=941827 cannon

https://wiki.debian.org/SecureBoot#MOK_-_Machine_Owner_Keycanonical: wiki.debian.org/SecureBootsources : notes[[https://wiki.debian.org/SecureBoot][SecureBoot - Debian Wiki]]

[C] [2021-02-28] https://undeadly.org/cgi?action=article;sid=20170930133438 cannon

'sid' matters here

TODO [C] hmm, server doesn't normalise properly?? (url escaping) cannon

CREATED: [2019-06-02]
ru.wikipedia.org/wiki/Грамматикализация

TODO [C] semiconductors video should be unified properly. well, or again hierarchical thing? might be too spammy for 'watch later' cannon

CREATED: [2019-07-15]

[C] [2020-06-16] https://news.ycombinator.com/item?id=23537243#23540421 hmm, both id and # ? cannon

[C] [2020-02-08] https://bugzilla.mozilla.org/show_bug.cgi?id=1411873 : ugh need to keep id cannon

TODO [C] [2020-01-12] old.reddit and new reddit cannon

[D] [2019-06-02] handle google.com/search cannon

[D] [2020-11-30] https://www.c-span.org/video/?c4808083/rust-language-chosen the ? is sneaky cannon

[D] [2020-11-22] https://melpa.org/#/async # is just redundant? cannon

[D] [2019-08-25] Lisp Language http://wiki.c2.com/?LispLanguage ? is sneaky cannon

[D] better regex for url extraction cannon

eh, urls can have commas… e.g. http://adit.io/posts/2013-04-17-functors,_applicatives,_and_monads_in_pictures.html
so, for csv need a separate extractor.

STRT [D] should be more defensive cannon

CREATED: [2019-06-05]
ValueError: netloc ' +79869929087, mak34@gmail.com' contains invalid characters under NFKC normalization

[2019-08-26] did I do it?** [2020-12-09] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=955208 'bug' parameter cannon

DONE [B] [2019-02-18] make sure ? extracted correctly https://play.google.com/store/apps/details?id=com.faultexception.reader cannon

DONE [C] [2019-05-04] https://news.ycombinator.com/item?id=12973788 cannon

id here is important

[2021-03-15] wiki.c2.com pages don't even have canonical? cannon

[D] * misc cannon

STRT [B] would be convenient to normalise reddit annotations so annotations from all comments would be collected cannon

CREATED: [2019-05-06]

TODO [C] [2019-09-03] potential pypi project? https://pypi.org/project/cannon cannon

TODO [C] hypothesis: wonder how it works on timestamped archive.org stuff? cannon

CREATED: [2019-11-01]

TODO [C] hmm some local and remote pages may overlap cannon

CREATED: [2019-07-13]

e.g. this is very likely to be mapped to normal py docss
file:///usr/share/doc/python3/html/library/contextlib.html

[C] [2020-05-11] Vision, Mission & Values — 2020 Update - WorldBrain.io - Medium cannon

fragments are often random and useless
even default org-mode is guilty

[C] [2019-07-09] Changed how threading works. by JakeHartnell · Pull Request 952 · hypothesis/h https://github.com/hypothesis/h/pull/952 cannonhypothesisreddit

TODO [C] reddit: tested on https://www.reddit.com/r/explainlikeimfive/comments/1vavyq/eli5_godels_ontological_proof/ceqlupx/ cannonhypothesis

CREATED: [2019-07-09]

huh, so reddit seems to normalise to the main page, and displays annotations as 'orphaned' for comment views?

[2019-07-09] so look like reddit referes to the 'post' page as canonical. Right. cannonhypothesis

-------------------------------------------------- cannon

[C] [2021-03-26] URLTeam - Archiveteam cannon

[C] [2021-03-25] seomoz/url-py: URL Transformation, Sanitization cannon

[C] [2021-03-03] (5) Jon Borichevskiy (@jondotbo) / Twitter cannonpromnesia

hmm how to resolve twitter renames?…

TODO [B] [2021-05-05] ClearURLs – automatically remove tracking elements from URLs | Hacker News cannon

Related, if you're looking to clean urls on the backend, here's my current pattern:

startswith: 'utm_', 'ga_', 'hmb_', 'ic_', 'fb_', 'pd_rd', 'ref_', 'share_', 'client_', 'service_'

or has: '$/ref@amazon.', '.tsrc', 'ICID', '_xtd', '_encoding@amazon.', '_hsenc', '_openstat', 'ab', 'action_object_map', 'action_ref_map', 'action_type_map', 'amp', 'arc404', 'affil', 'affiliate', 'app_id', 'awc', 'bfsplash', 'bftwuk', 'campaign', 'camp', 'cip', 'cmp', 'CMP', 'cmpid', 'curator', 'cvid@bing.com', 'efg', 'ei@google.', 'fbclid', 'fbplay', 'feature@youtube.com', 'feedName', 'feedType', 'form@bing.com', 'forYou', 'fsrc', 'ftcamp', 'ga_campaign', 'ga_content', 'ga_medium', 'ga_place', 'ga_source', 'ga_term', 'gi', 'gclid@youtube.com', 'gs_l', 'gws_rd@google.', 'igshid', 'instanceId', 'instanceid', 'kw@youtube.com', 'maca', 'mbid', 'mkt_tok', 'mod', 'ncid', 'ocid', 'offer', 'origin', 'partner','pq@bing.com', 'print', 'printable', 'psc@amazon.', 'qs@bing.com', 'rebelltitem', 'ref', 'referer', 'referrer', 'rss', 'ru', 'sc@bing.com', 'scrolla', 'sei@google.', 'sh', 'share', 'sk@bing.com', 'source', 'sp@bing.com', 'sref', 'srnd', 'supported_service_name', 'tag', 'taid', 'time_continue', 'tsrc', 'twsrc', 'twcamp', 'twclid', 'tweetembed', 'twterm', 'twgr', 'utm', 'ved@google.', 'via', 'xid', 'yclid', 'yptr' 
Jump to search, settings & sitemap