Anchor link types and behaviours
Generic
cmark, github
A translated version of the Ruby algorithm is used in md-toc. The original one is repored here:
I could not find the code directly responsable for the anchor link generation. See also:
https://githubengineering.com/a-formal-spec-for-github-markdown/
https://github.com/github/cmark/issues/65#issuecomment-343433978
Apparently GitHub (and possibly others) filter HTML tags in the anchor links.
This is an undocumented feature (?) so the remove_html_tags function was
added to address this problem. Instead of designing an algorithm to detect HTML tags,
regular expressions came in handy. All the rules
present in https://spec.commonmark.org/0.28/#raw-html have been followed by the
letter. Regular expressions are divided by type and are composed at the end
by concatenating all the strings. For example:
1# Comment start.
2COS = '<!--'
3# Comment text.
4COT = '((?!>|->)(?:(?!--).))+(?!-).?'
5# Comment end.
6COE = '-->'
7# Comment.
8CO = COS + COT + COE
HTML tags are stripped using the re.sub replace function, for example:
line = re.sub(CO, str(), line, flags=re.DOTALL)
GitHub added an extension in GFM to ignore certain HTML tags, valid at least from versions 0.27.1.gfm.3 to 0.29.0.gfm.0:
gitlab
New rules have been written:
redcarpet
Treats consecutive dash characters by tranforming them into a single dash character. A translated version of the C algorithm is used in md-toc. The original version is here:
See also:
Emphasis
To be able to have working anchor links, emphasis must also be removed from the link destination.
cmark, github, gitlab
At the moment the implementation of emnphasis removal is incomplete because of its complexity. See:
The core functions for this feature have been ported directly from the original cmark source with some differences:
things such as string manipulation, mallocs, etc are different in Python
the
cmark_utf8proc_charlenuseslength = 1instead oflength = utf8proc_utf8class[ord(line[0])](causes list overflow).The
cmark_utf8proc_charlenfunction is related to thecmark_utf8proc_encode_charfunction. Have a look at that function to know character lengths in cmark.In Python 3, since all characters are UTF-8 by default, they are all represented with length 1. See:
As of the release md-toc 8.1.2, cmark-gfm is still at version 0.29. Moreover, certain code sections used in the emphasis processing are not the same of cmark 0.29. See this one for example:
https://github.com/github/cmark-gfm/blob/0.29.0.gfm.3/src/inlines.c#L639-L654
https://github.com/commonmark/cmark/blob/0.29.0/src/inlines.c#L615-L621
For the moment md-toc uses the original cmark source only as reference for emphasis processing.