Document Metadata Extraction
Documents published online often contain embedded properties detailing their authors, internal file paths, and editing history. This showcase displays the scripts used to gather and parse public media documents, and presents the resulting findings.
To retrieve attachments in bulk from sequential pathways, we use a Python script. It automatically requests each incremental ID and determines the appropriate file extension by inspecting HTTP Content-Type headers.
import os
import requests
import mimetypes
from urllib.parse import urljoin
import time
# Configured for local testing
BASE_URL = "https://www.cafcass.gov.uk/media/"
OUTPUT_DIR = "./downloaded_assets"
START_ID = 1
END_ID = 5000
os.makedirs(OUTPUT_DIR, exist_ok=True)
def download_file(resource_id):
file_url = urljoin(BASE_URL, str(resource_id))
try:
# Use a GET request to retrieve the headers and content
response = requests.get(file_url, timeout=10, verify=False)
if response.status_code == 200:
# 1. Read the Content-Type header (e.g., "application/pdf")
content_type = response.headers.get('Content-Type', '').split(';')[0].strip()
# 2. Look up the standard extension for this MIME type
ext = mimetypes.guess_extension(content_type)
# Fall back to a generic extension if the type is unknown
if not ext:
ext = '.dat'
# 3. Construct the filename using the correct extension
filename = f"asset_{resource_id}{ext}"
filepath = os.path.join(OUTPUT_DIR, filename)
with open(filepath, 'wb') as f:
f.write(response.content)
print(f"[+] Downloaded asset {resource_id} as {filename} (MIME: {content_type})")
else:
print(f"[-] Asset {resource_id} returned status: {response.status_code}")
except requests.exceptions.RequestException as e:
print(f"[!] Error fetching asset {resource_id}: {e}")
if __name__ == "__main__":
print(f"Starting generic download from {BASE_URL}...")
for rid in range(START_ID, END_ID + 1):
download_file(rid)
time.sleep(0.1)
Once files are downloaded locally, we use a Windows PowerShell script utilizing the COM Shell.Application object. This extracts extended shell properties such as System.Author and System.Document.LastAuthor.
$folderPath = "C:\Users\rtrav\govuk\downloaded_assets"
$shell = New-Object -com shell.application
$folder = $shell.Namespace($folderPath)
Get-ChildItem $folderPath | ForEach-Object {
$file = $folder.ParseName($_.Name)
[PSCustomObject]@{
FileName = $_.Name
Author = ($file.ExtendedProperty("System.Author") -join "; ")
LastSavedBy = $file.ExtendedProperty("System.Document.LastAuthor")
}
} | Export-Csv -Path "metadata_export.csv" -NoTypeInformation Sanitizing Before Publishing
Organizations should mandate automated document scrubbing/sanitization pipelines before assets are uploaded to public CMS repositories. Modern document management suites can strip custom properties, author identities, revision trackers, and internal folder path strings automatically.
By scanning metadata on Cafcass documents, the following names, department tags, and creator signatures were extracted. This demonstrates how easily public files leak team directories and individual contributors.