Automating Markdown Management: Scripts for Consolidating Documentation on GitHub

The scripts discussed in this blog aim to automate the process of retrieving, combining, and updating Markdown files in a GitHub repository. Markdown is a lightweight markup language with plain text formatting syntax, and it’s commonly used for creating formatted text on the web. These scripts are particularly useful for documentation or projects that require a compilation of various Markdown documents into a single, cohesive file.

Here is a breakdown of the overarching goals of the scripts:

Retrieve Markdown Files from GitHub: The first part of the scripts involves connecting to the GitHub repository using the GitHub API. The objective is to fetch a list of all the Markdown (.md) files available in the repository. This step takes into account the structure and naming conventions of the files, retrieving them in a sorted order, with README.md often being the initial file as it usually serves as the entry point or introduction to the repository.

Combine Markdown Files: Once the list of Markdown files is retrieved, the scripts download the content of each file. These contents are then combined into a single Markdown document. This combination process may involve cleaning up or reformatting headings and other elements to ensure that the single document maintains readability and a logical structure after the merge.

Push Combined File Back to GitHub: After creating a single, combined Markdown document, the scripts then push this new document back to the original GitHub repository. This step may include creating a new file or updating an existing one with the combined content. The operation involves committing the changes to the repository, which keeps a record of the update and allows for version control.

Automation and Efficiency: The entire process is automated using Python or PowerShell scripts. This automation is designed to save time and reduce the risk of human error that can occur with manual combining and updating of documentation files. It is particularly useful for projects that regularly update their documentation or have multiple contributors, as it ensures that the latest information is always compiled and available in a single, updated document.

These scripts are flexible and can be customized to suit specific project needs, such as sorting files in a particular order, handling different file hierarchies, or dealing with complex document structures. The use of these scripts exemplifies how programming can be utilized to streamline workflow processes, enhance collaboration, and maintain organized and up-to-date documentation in software development projects.

Join Markdown

This a script that concatenates multiple Markdown files into a single file, it requires some steps to ensure the headings and other elements are adjusted appropriately to maintain the document structure.

Below is a Python script that does the following:

  • Takes a list of Markdown filenames.
  • Adjusts their heading levels to maintain structure.
  • Concatenates them into a single Markdown file.
import re

def adjust_headings(text, level_increase=1):
    """
    Adjust the heading levels in the given markdown text.
    """
    def replace_func(match):
        return '#' * (len(match.group(0)) + level_increase)

    # This regex matches markdown headings
    return re.sub(r'^(#{1,6})', replace_func, text, flags=re.MULTILINE)

def concatenate_markdown_files(filenames, output_filename='combined.md'):
    """
    Concatenate a list of markdown files into a single file with adjusted headings.
    """
    with open(output_filename, 'w') as outfile:
        for filename in filenames:
            with open(filename, 'r') as infile:
                text = infile.read()
                # Increase heading levels by 1 (or desired amount)
                adjusted_text = adjust_headings(text, 1)
                outfile.write(adjusted_text + '\n\n')

# List of markdown files to concatenate
markdown_files = ['file1.md', 'file2.md', 'file3.md']

# Output file name
output_file = 'combined.md'

# Concatenate files
concatenate_markdown_files(markdown_files, output_file)

print(f'Concatenated Markdown written to {output_file}')

Using the GitHub API – Python

Retrieving a list of Markdown files from a GitHub repository can be done using the GitHub API. Below is a Python script example that uses the requests library to call the GitHub API and retrieve a list of all Markdown .md files from a specified repository:

  • Retrieves the list of Markdown files from a specified GitHub repository.
  • Downloads the contents of these files.
  • Concatenates them into a single Markdown file, making sure README.md (if present) is first.
  • Commits and pushes the single Markdown file back to the GitHub repository.

If you’re planning on using this script frequently or with private repositories, you should authenticate your requests using a personal access token. You can add the token to your request like this:

headers = {'Authorization': 'token YOUR_TOKEN'}
response = requests.get(api_url, headers=headers)

To do this, you’ll need a GitHub Personal Access Token with the appropriate permissions to access repositories, read their contents, and push changes. See managing-your-personal-access-tokens

you will need to install requests

pip install requests

Here’s an outline of the script:

import requests
from requests.auth import HTTPBasicAuth
import base64
import re

# Constants for GitHub API headers, including the authorization token.
# Note: The token should be kept secret and not hardcoded in the code. Use environment variables for production.
headers = {
    'Accept': 'application/vnd.github.v3+json',
    'Authorization': 'token <YOUR_GITHUB_TOKEN>'
}

def get_repo_contents(user, repo, path=''):
    """
    Get the contents of a repository at a specified path.

    :param user: GitHub username
    :param repo: GitHub repository name
    :param path: path inside the repository (optional, default is root)
    :return: JSON response with repository contents
    """
    api_url = f"https://api.github.com/repos/{user}/{repo}/contents/{path}"
    response = requests.get(api_url, headers=headers)
    response.raise_for_status()
    return response.json()

def get_markdown_files(repo_contents):
    """
    Filter and sort the list of files in the repository to get Markdown files.

    :param repo_contents: JSON response with repository contents
    :return: List of sorted Markdown files, excluding README.md
    """
    return sorted([file for file in repo_contents if file['name'].endswith('.md')], key=lambda x: (x['name'] != 'README.md', x['name']))

def download_files(files_info):
    """
    Download the content of each file in the list of files.

    :param files_info: List of file information, which includes the download URL
    :return: List of contents of each Markdown file
    """
    md_contents = []
    for file_info in files_info:
        download_url = file_info['download_url']
        response = requests.get(download_url)
        response.raise_for_status()
        md_contents.append(response.text)
    return md_contents

def combine_markdown(md_files_contents):
    """
    Combine the content of all Markdown files into a single string.

    :param md_files_contents: List of contents of each Markdown file
    :return: A single string containing all combined Markdown content
    """
    combined_md = '\n\n'.join(md_files_contents)
    return combined_md

def push_to_github(user, repo, path, content, commit_message):
    """
    Push a file's content to GitHub repository.

    :param user: GitHub username
    :param repo: GitHub repository name
    :param path: Path where the file will be pushed
    :param content: Content to be pushed
    :param commit_message: Commit message
    :return: JSON response from the GitHub API
    """
    api_url = f"https://api.github.com/repos/{user}/{repo}/contents/{path}"
    get_response = requests.get(api_url, headers=headers)

    # If file exists, use its SHA to update, else create a new file
    sha = get_response.json().get('sha') if get_response.status_code == 200 else None

    # Encode content to base64 as required by GitHub API
    base64content = base64.b64encode(content.encode('utf-8')).decode('utf-8')

    # Prepare data payload for the PUT request
    data = {
        "message": commit_message,
        "committer": {
            "name": "Your Name",
            "email": "[email protected]"
        },
        "content": base64content,
        "sha": sha
    }

    # If creating a new file, the 'sha' field should not be included
    if not sha:
        del data["sha"]

    # Make the PUT request to GitHub API
    response = requests.put(api_url, headers=headers, json=data)
    response.raise_for_status()
    return response.json()

# Main process
github_user = 'mygithubusername'
github_repo = 'mygithubreponame'
github_path = ''
output_file_path = 'combined.md'
commit_message = 'Update combined markdown file'

try:
    # Step 1: Get the list of Markdown files from the repository
    contents = get_repo_contents(github_user, github_repo, github_path)
    markdown_files_info = get_markdown_files(contents)
    
    # Step 2: Download the content of Markdown files
    markdown_files_contents = download_files(markdown_files_info)
    
    # Step 3: Combine the downloaded Markdown content into a single document
    combined_md = combine_markdown(markdown_files_contents)
    
    # Step 4: Push the combined Markdown content back to GitHub
    push_result = push_to_github(github_user, github_repo, output_file_path, combined_md, commit_message)
    print(f"Successfully pushed to {push_result['content']['html_url']}")
except requests.HTTPError as http_err:
    # If an HTTP error occurs, print

Replace YOUR_GITHUB_TOKEN with your actual GitHub token, username with the GitHub username or organization name, repository with the repository name, and adjust Your Name and [email protected] with your details.

Note that this script is quite basic and assumes:

  • All the Markdown files are in the root of the repository.
  • The README.md is in the root and will be the first file.
  • You have the necessary permissions to push to the repository.
  • You would also need to handle API rate limits and pagination for repositories with many files.

Please ensure you understand the implications of using your Personal Access Token in scripts, and secure it appropriately.

In a production environment, you would want to use environment variables or a configuration file to store sensitive information like API tokens.

Using the GitHub API – PowerShell

Here is an example of how you could achieve the same task using PowerShell. Please ensure you have the correct permissions and your GitHub personal access token ready to use.

Do not share your token in your scripts or store it in a public place.


# Set your GitHub username and repository
$user = "yourusername"
$repo = "yourrepo"

# Set the GitHub API token as an environment variable for security
$env:GITHUB_TOKEN = "<YOUR_GITHUB_TOKEN>"

# Base64 encode the GitHub token for authorization
$base64AuthInfo = [Convert]::ToBase64String([Text.Encoding]::ASCII.GetBytes(("{0}:{1}" -f $user,$env:GITHUB_TOKEN)))

# Function to retrieve the list of markdown files from GitHub repository
function Get-MarkdownFilesFromRepo {
    param (
        [string]$User,
        [string]$Repository
    )

    $headers = @{
        Authorization=("Basic {0}" -f $base64AuthInfo)
        Accept="application/vnd.github.v3.raw"
    }

    $apiUrl = "https://api.github.com/repos/$User/$Repository/git/trees/main?recursive=1"
    $response = Invoke-RestMethod -Uri $apiUrl -Method Get -Headers $headers

    # Filter out markdown files and return their paths
    return $response.tree | Where-Object { $_.path -like '*.md' } | Sort-Object path
}

# Function to download the content of markdown files
function Get-ContentFromMarkdownFiles {
    param (
        [object[]]$MarkdownFiles
    )

    $headers = @{
        Authorization=("Basic {0}" -f $base64AuthInfo)
        Accept="application/vnd.github.v3.raw"
    }

    $contentList = @()

    foreach ($file in $MarkdownFiles) {
        $fileResponse = Invoke-RestMethod -Uri $file.url -Method Get -Headers $headers
        $contentList += $fileResponse
    }

    return $contentList
}

# Function to update or create a markdown file in the repository
function Update-GithubMarkdownFile {
    param (
        [string]$User,
        [string]$Repository,
        [string]$FilePath,
        [string]$Content,
        [string]$Message
    )

    $headers = @{
        Authorization=("Basic {0}" -f $base64AuthInfo)
        Accept="application/vnd.github.v3+json"
    }

    $body = @{
        message = $Message
        content = [Convert]::ToBase64String([Text.Encoding]::UTF8.GetBytes($Content))
        # If updating an existing file, 'sha' of the file should be included in the body
        # sha = <SHA_OF_THE_FILE_TO_UPDATE>
    } | ConvertTo-Json

    $apiUrl = "https://api.github.com/repos/$User/$Repository/contents/$FilePath"
    $response = Invoke-RestMethod -Uri $apiUrl -Method Put -Body $body -Headers $headers -ContentType "application/json"

    return $response
}

# Main process
try {
    $markdownFiles = Get-MarkdownFilesFromRepo -User $user -Repository $repo
    $markdownContent = Get-ContentFromMarkdownFiles -MarkdownFiles $markdownFiles
    $combinedContent = $markdownContent -join "`n`n"
    $updateResponse = Update-GithubMarkdownFile -User $user -Repository $repo -FilePath "combined.md" -Content $combinedContent -Message "Combine markdown files"
    Write-Host "Successfully updated file: $($updateResponse.content.html_url)"
}
catch {
    Write-Error "An error occurred: $_"
}

Make sure to replace <YOUR_GITHUB_TOKEN> with your actual GitHub token.

This script follows a similar structure to the Python script but adapted to PowerShell:

  • Get-MarkdownFilesFromRepo: Retrieves a list of markdown files from the specified GitHub repository.
  • Get-ContentFromMarkdownFiles: Downloads the content of each markdown file.
  • Update-GithubMarkdownFile: Pushes the combined markdown content back to GitHub. If updating an existing file, you will need to retrieve the file’s SHA and include it in the request body.
  • The main process then executes these functions, combines the content of markdown files, and pushes the combined content to the GitHub repository.

Handling 404 Errors

A 404 Not Found error when trying to access the GitHub API usually means that the URL is incorrect or the resource doesn’t exist. Here are some possible reasons and solutions:

Incorrect Repository Name/User: Ensure that the user (yourusername) and repository (yourrepo) names are spelled correctly, and that the repository actually exists and is public. If it’s a private repository, make sure your token has the right permissions.

API Rate Limiting: If you’re not using a token or your token doesn’t have the correct permissions, GitHub API usage is quite limited. Check if you’ve hit the rate limit.

Branch Name: By default, GitHub repositories now name their primary branch main instead of master. If you have specified the branch name in the API call and the repository’s primary branch has a different name, it will lead to a 404 error.

Access Token Permissions: If the repository is private, make sure that your GitHub token has the repo scope to access private repositories.

Before executing the main process, check if the repository exists by visiting https://github.com/yourusername/yourrepo. If the repository exists, ensure the path you are trying to access (contents/) is correct.

If you have confirmed that the repository and user names are correct, and the repository is public, the next step is to make sure that your access token is correct and has the necessary permissions. Double-check the token, and if it’s a private repository, make sure you’ve given the token the appropriate scope.

Finally, if you are sure the repository exists and your token is correctly set up, check the branch name in the function get_repo_contents in the branch=’main’ parameter. If the repository uses a different default branch name, you’ll need to specify that name.

Once you’ve checked all the above, try to run the script again. If you’re still encountering issues, you may want to run a curl command or use Postman to manually check the API response before executing it in the script. Here’s a curl example to test access to the repository:

curl -H "Authorization: token YOUR_GITHUB_TOKEN" \
     -H "Accept: application/vnd.github.v3+json" \
     "https://api.github.com/repos/yourusername/yourrepo/contents/"

Make sure to replace YOUR_GITHUB_TOKEN with your actual token. If the curl command works but your script does not, you’ll need to troubleshoot the script further. If the curl command also fails, then the issue may lie with the repository access settings or the token permissions.

In the following check script:

  • The script sends an HTTP GET request to the GitHub API.
  • If successful, it will list the file paths in the repository’s root directory.
  • If there’s an error (like a 404), it will display the status code, status description, and error message.
  • The headers are passed as a hashtable to the -Headers parameter.
  • The User-Agent header is included in the hashtable.
  • The personal access token should replace YOUR_GITHUB_TOKEN in the Authorization field.

$Headers = @{
    Authorization = "token YOUR_GITHUB_TOKEN"
    Accept = "application/vnd.github.v3+json"
}

$Uri = "https://api.github.com/repos/yourusername/yourepo/contents/"

try {
    $Response = Invoke-WebRequest -Uri $Uri -Headers $Headers -Method Get
    $Content = $Response.Content
    $RepositoryContent = $Content | ConvertFrom-Json
    foreach ($File in $RepositoryContent) {
        Write-Host "File Path: $($File.path)"
    }
} catch {
    Write-Error $_.Exception.Response.StatusCode.Value__
    Write-Error $_.Exception.Response.StatusDescription
    Write-Error $_.Exception.Message
}

If you are still encountering the 404 error, you should:

  • Check that the GitHub token is correct and has the proper scopes enabled.
  • Ensure the repository yourusername/yourrepo is indeed public. If the repository is private, ensure your GitHub token has the repo scope to access private repositories.

Run this script in your PowerShell console after replacing YOUR_GITHUB_TOKEN with the actual token value. If it is successful, it will print out the file paths of the contents in the repository. If there’s an error, it will print out more detailed error information which can help in further troubleshooting.