About Blog Login

Markdown

So, I need to create posts that will contain simple formatting from anywhere. Basically write simple plain text that will be displayed neatly with formats like bold and italic and with some inline code and syntax highlighted code blocks.

There is already a well known format for that. It's Markdown. I will not implement all of the specs, but just a very small subset.

Of course there are so many libraries that convert Markdown to HTML. There are PHP libraries that will do it server side and there are are JavaScript ones that will convert it client side on the browser.

Having looked at few libraries I decided to do the work myself. It will be good practice for writing regex code and doing some string operations.

The subset of Markdown that I want at the beginning is

1. Headings

2. Bolds and italics

3. Inline code

4. Fenced code blocks

5. Links

Here is a very simple Markdown PHP function


function md2html(string $md): string {
    $lines = preg_split("/\r\n|\r|\n/", $md);
    $parsed = [];

    $inCode = false;

    foreach ($lines as $line) {
        // echo ($line) . PHP_EOL;
        if ($inCode) {
            if (str_starts_with($line, "```")) {
                $inCode = false;
                $parsed[] = '</code></pre>' . PHP_EOL;
            } else {
                $parsed[] = htmlspecialchars($line) . PHP_EOL;
            }
        } else {
            if (str_starts_with($line, "```")) {
                $lang = substr($line, 3);
                $inCode = true;
                $parsed[] = "<pre><code class=\"language-$lang\">"; // note we will NOT add extra new line
            } else {
                // first replace any word or sub line patterns
                // ** emphasis
                $line = preg_replace("'\*\*([^*]+)\*\*'", "<string>$1</strong>", $line);
                // __ emphasis
                $line = preg_replace("'__([^_]+)__'", "<em>$1</em>", $line);
                // links
                $line = preg_replace("/\[([^]]+)\]\(([^)]+)\)/", '<a href="$2">$1</a>', $line);
                // inline code
                $line = preg_replace("/`([^`]+)`/", "<code>$1</code>", $line);

                if (str_starts_with($line, "### ")) {
                    $hdr = substr($line, 4);
                    $parsed[] = "<h1>$hdr</h1>" . PHP_EOL;
                } elseif (str_starts_with($line, "## ")) {
                    $hdr = substr($line, 3);
                    $parsed[] = "<h1>$hdr</h1>" . PHP_EOL;
                } elseif (str_starts_with($line, "# ")) {
                    $hdr = substr($line, 2);
                    $parsed[] = "<h1>$hdr</h1>" . PHP_EOL;
                } elseif ($line === "---") {
                    $parsed[] = "<hr/>" . PHP_EOL;
                } elseif (empty($line)) {
                    // skip empty lines
                } else {
                    $parsed[] = "<p>$line</p>" . PHP_EOL;
                }
            }
        }
    }

    $h = implode($parsed);
    return $h;
}

Note that I'm only escaping HTML entities that are part of the fenced code block. I don't expect to need to escape HTML entities in the body of the post. Actually, maybe I need to add HTML tags directly in the post and I'd like to have them as is. So it's a feature, not a bug kind of thing

Add a Comment