DOCUMENT INTELLIGENCE ANALYSIS ANALYSIS

Document Metadata Extraction

Documents published online often contain embedded properties detailing their authors, internal file paths, and editing history. This showcase displays the scripts used to gather and parse public media documents, and presents the resulting findings.

Step 01 // Bulk Media Fetcher (Python)

To retrieve attachments in bulk from sequential pathways, we use a Python script. It automatically requests each incremental ID and determines the appropriate file extension by inspecting HTTP Content-Type headers.

import os
import requests
import mimetypes
from urllib.parse import urljoin
import time

# Configured for local testing
BASE_URL = "https://www.cafcass.gov.uk/media/"
OUTPUT_DIR = "./downloaded_assets"
START_ID = 1
END_ID = 5000

os.makedirs(OUTPUT_DIR, exist_ok=True)

def download_file(resource_id):
    file_url = urljoin(BASE_URL, str(resource_id))
    
    try:
        # Use a GET request to retrieve the headers and content
        response = requests.get(file_url, timeout=10, verify=False)
        
        if response.status_code == 200:
            # 1. Read the Content-Type header (e.g., "application/pdf")
            content_type = response.headers.get('Content-Type', '').split(';')[0].strip()
            
            # 2. Look up the standard extension for this MIME type
            ext = mimetypes.guess_extension(content_type)
            
            # Fall back to a generic extension if the type is unknown
            if not ext:
                ext = '.dat'
                
            # 3. Construct the filename using the correct extension
            filename = f"asset_{resource_id}{ext}"
            filepath = os.path.join(OUTPUT_DIR, filename)
            
            with open(filepath, 'wb') as f:
                f.write(response.content)
            print(f"[+] Downloaded asset {resource_id} as {filename} (MIME: {content_type})")
        else:
            print(f"[-] Asset {resource_id} returned status: {response.status_code}")
            
    except requests.exceptions.RequestException as e:
        print(f"[!] Error fetching asset {resource_id}: {e}")

if __name__ == "__main__":
    print(f"Starting generic download from {BASE_URL}...")
    for rid in range(START_ID, END_ID + 1):
        download_file(rid)
        time.sleep(0.1)

Step 02 // Metadata Property Parser (PowerShell)

Once files are downloaded locally, we use a Windows PowerShell script utilizing the COM Shell.Application object. This extracts extended shell properties such as System.Author and System.Document.LastAuthor.

$folderPath = "C:\Users\rtrav\govuk\downloaded_assets"
$shell = New-Object -com shell.application
$folder = $shell.Namespace($folderPath)

Get-ChildItem $folderPath | ForEach-Object {
    $file = $folder.ParseName($_.Name)
    [PSCustomObject]@{
        FileName      = $_.Name
        Author        = ($file.ExtendedProperty("System.Author") -join "; ")
        LastSavedBy   = $file.ExtendedProperty("System.Document.LastAuthor")
    }
} | Export-Csv -Path "metadata_export.csv" -NoTypeInformation

Security Recommendation

Sanitizing Before Publishing

Organizations should mandate automated document scrubbing/sanitization pipelines before assets are uploaded to public CMS repositories. Modern document management suites can strip custom properties, author identities, revision trackers, and internal folder path strings automatically.

Parsed Findings // Extracted Author Names

By scanning metadata on Cafcass documents, the following names, department tags, and creator signatures were extracted. This demonstrates how easily public files leak team directories and individual contributors.

Official / Staff Identifiers (17)

Lynch, Jennifer - Cafcass

Evans, Claire - Cafcass

Cheema, Sandeep - Cafcass

Weetch, Emma - Cafcass

john, Rebecca - Cafcass

Nelmes, Linda - Cafcass

Pitcher, David - Cafcass

Hyde, Andy - Cafcass

Halliday, Emily - Cafcass

john, Rebecca - Cafcass

Baldwin, Carol - Cafcass

Marrinan, Maria - Cafcass

Grammatica, Karen - Cafcass

Marsh, Jane - Cafcass

Rodger, Holly - Cafcass

Blakebrough, Nicola - Cafcass

Egbewole-Adereti, Grace - Cafcass

External Authors & Usernames (81)

Peter Bates

Natasha Graves

Alex Jones

Stuart Robinson Sussex University

Jigna Patel

Jennifer Okoro-Thompson

fgood

Ria Carrogan

Gemma Gratton

rcafagafar

Terry Phillips

sadam

Sandeep Cheema

Maria Marrinan

Daniel Kelly (he/him)

Nicola Rodgers

rcafRjohn

Daniel Kelly

Saskia Pemberton

Natalie Wyatt

Dani Spadavecchia (she/her)

Penfold, Hannah

Sarah Rothera

John McGagh

Charlotte Cooklin

dlionetti

David Pitcher

rcafRjohn

rcafmmarrinan

Chris MacDonald

Sheena Webb

Gemma Banks

Sarah Parsons

Jacob Lund

gpointstudio

Emese

fizkes

Monkey Business Images

bbernard

Anna Kadulina

Pixel-Shot

Amanda Flower

Ria Carrogan

Thomas, Liz

Billy Marsh

Dawn Hodson

maria.calver

Maria Marrinan

Shaelyn Stout

Andrew Lamberti

Kitty Clark

Andrew Hyde

Dani Spadavecchia (she/her)

Jennifer Gibbon-Lynch

Lee Dales

fgood

John McGagh (he/him)

Vickie Clare (she/her)

HappyKids

SOL STOCK LTD

spb2015

SDI PRODUCTIONS

Fly View Productions

drazen_zigic

bymuratdeniz

Julia Dark

Alex Muntoni

James Jackson-Ellis

Hannah Lamb

James Jackson-Ellis

Ngoc Khanh Ha

Julie Brown

Jacky Tiotto

Nicola Blakebrough

Gardner, Matthew

Orme, Liam [NOMS]

rcafssheikh