🤖
hacktricks
  • 👾Welcome!
    • HackTricks
    • HackTricks Values & FAQ
    • About the author
  • 🤩Generic Methodologies & Resources
    • Pentesting Methodology
    • External Recon Methodology
      • Wide Source Code Search
      • Github Dorks & Leaks
    • Pentesting Network
      • DHCPv6
      • EIGRP Attacks
      • GLBP & HSRP Attacks
      • IDS and IPS Evasion
      • Lateral VLAN Segmentation Bypass
      • Network Protocols Explained (ESP)
      • Nmap Summary (ESP)
      • Pentesting IPv6
      • WebRTC DoS
      • Spoofing LLMNR, NBT-NS, mDNS/DNS and WPAD and Relay Attacks
      • Spoofing SSDP and UPnP Devices with EvilSSDP
    • Pentesting Wifi
      • Evil Twin EAP-TLS
    • Phishing Methodology
      • Clone a Website
      • Detecting Phishing
      • Phishing Files & Documents
    • Basic Forensic Methodology
      • Baseline Monitoring
      • Anti-Forensic Techniques
      • Docker Forensics
      • Image Acquisition & Mount
      • Linux Forensics
      • Malware Analysis
      • Memory dump analysis
        • Volatility - CheatSheet
      • Partitions/File Systems/Carving
        • File/Data Carving & Recovery Tools
      • Pcap Inspection
        • DNSCat pcap analysis
        • Suricata & Iptables cheatsheet
        • USB Keystrokes
        • Wifi Pcap Analysis
        • Wireshark tricks
      • Specific Software/File-Type Tricks
        • Decompile compiled python binaries (exe, elf) - Retreive from .pyc
        • Browser Artifacts
        • Deofuscation vbs (cscript.exe)
        • Local Cloud Storage
        • Office file analysis
        • PDF File analysis
        • PNG tricks
        • Video and Audio file analysis
        • ZIPs tricks
      • Windows Artifacts
        • Interesting Windows Registry Keys
    • Brute Force - CheatSheet
    • Python Sandbox Escape & Pyscript
      • Bypass Python sandboxes
        • LOAD_NAME / LOAD_CONST opcode OOB Read
      • Class Pollution (Python's Prototype Pollution)
      • Python Internal Read Gadgets
      • Pyscript
      • venv
      • Web Requests
      • Bruteforce hash (few chars)
      • Basic Python
    • Exfiltration
    • Tunneling and Port Forwarding
    • Threat Modeling
    • Search Exploits
    • Reverse Shells (Linux, Windows, MSFVenom)
      • MSFVenom - CheatSheet
      • Reverse Shells - Windows
      • Reverse Shells - Linux
      • Full TTYs
  • 🐧Linux Hardening
    • Checklist - Linux Privilege Escalation
    • Linux Privilege Escalation
      • Arbitrary File Write to Root
      • Cisco - vmanage
      • Containerd (ctr) Privilege Escalation
      • D-Bus Enumeration & Command Injection Privilege Escalation
      • Docker Security
        • Abusing Docker Socket for Privilege Escalation
        • AppArmor
        • AuthZ& AuthN - Docker Access Authorization Plugin
        • CGroups
        • Docker --privileged
        • Docker Breakout / Privilege Escalation
          • release_agent exploit - Relative Paths to PIDs
          • Docker release_agent cgroups escape
          • Sensitive Mounts
        • Namespaces
          • CGroup Namespace
          • IPC Namespace
          • PID Namespace
          • Mount Namespace
          • Network Namespace
          • Time Namespace
          • User Namespace
          • UTS Namespace
        • Seccomp
        • Weaponizing Distroless
      • Escaping from Jails
      • euid, ruid, suid
      • Interesting Groups - Linux Privesc
        • lxd/lxc Group - Privilege escalation
      • Logstash
      • ld.so privesc exploit example
      • Linux Active Directory
      • Linux Capabilities
      • NFS no_root_squash/no_all_squash misconfiguration PE
      • Node inspector/CEF debug abuse
      • Payloads to execute
      • RunC Privilege Escalation
      • SELinux
      • Socket Command Injection
      • Splunk LPE and Persistence
      • SSH Forward Agent exploitation
      • Wildcards Spare tricks
    • Useful Linux Commands
    • Bypass Linux Restrictions
      • Bypass FS protections: read-only / no-exec / Distroless
        • DDexec / EverythingExec
    • Linux Environment Variables
    • Linux Post-Exploitation
      • PAM - Pluggable Authentication Modules
    • FreeIPA Pentesting
  • 🍏MacOS Hardening
    • macOS Security & Privilege Escalation
      • macOS Apps - Inspecting, debugging and Fuzzing
        • Objects in memory
        • Introduction to x64
        • Introduction to ARM64v8
      • macOS AppleFS
      • macOS Bypassing Firewalls
      • macOS Defensive Apps
      • macOS GCD - Grand Central Dispatch
      • macOS Kernel & System Extensions
        • macOS IOKit
        • macOS Kernel Extensions & Debugging
        • macOS Kernel Vulnerabilities
        • macOS System Extensions
      • macOS Network Services & Protocols
      • macOS File Extension & URL scheme app handlers
      • macOS Files, Folders, Binaries & Memory
        • macOS Bundles
        • macOS Installers Abuse
        • macOS Memory Dumping
        • macOS Sensitive Locations & Interesting Daemons
        • macOS Universal binaries & Mach-O Format
      • macOS Objective-C
      • macOS Privilege Escalation
      • macOS Process Abuse
        • macOS Dirty NIB
        • macOS Chromium Injection
        • macOS Electron Applications Injection
        • macOS Function Hooking
        • macOS IPC - Inter Process Communication
          • macOS MIG - Mach Interface Generator
          • macOS XPC
            • macOS XPC Authorization
            • macOS XPC Connecting Process Check
              • macOS PID Reuse
              • macOS xpc_connection_get_audit_token Attack
          • macOS Thread Injection via Task port
        • macOS Java Applications Injection
        • macOS Library Injection
          • macOS Dyld Hijacking & DYLD_INSERT_LIBRARIES
          • macOS Dyld Process
        • macOS Perl Applications Injection
        • macOS Python Applications Injection
        • macOS Ruby Applications Injection
        • macOS .Net Applications Injection
      • macOS Security Protections
        • macOS Gatekeeper / Quarantine / XProtect
        • macOS Launch/Environment Constraints & Trust Cache
        • macOS Sandbox
          • macOS Default Sandbox Debug
          • macOS Sandbox Debug & Bypass
            • macOS Office Sandbox Bypasses
        • macOS Authorizations DB & Authd
        • macOS SIP
        • macOS TCC
          • macOS Apple Events
          • macOS TCC Bypasses
            • macOS Apple Scripts
          • macOS TCC Payloads
        • macOS Dangerous Entitlements & TCC perms
        • macOS - AMFI - AppleMobileFileIntegrity
        • macOS MACF - Mandatory Access Control Framework
        • macOS Code Signing
        • macOS FS Tricks
          • macOS xattr-acls extra stuff
      • macOS Users & External Accounts
    • macOS Red Teaming
      • macOS MDM
        • Enrolling Devices in Other Organisations
        • macOS Serial Number
      • macOS Keychain
    • macOS Useful Commands
    • macOS Auto Start
  • 🪟Windows Hardening
    • Checklist - Local Windows Privilege Escalation
    • Windows Local Privilege Escalation
      • Abusing Tokens
      • Access Tokens
      • ACLs - DACLs/SACLs/ACEs
      • AppendData/AddSubdirectory permission over service registry
      • Create MSI with WIX
      • COM Hijacking
      • Dll Hijacking
        • Writable Sys Path +Dll Hijacking Privesc
      • DPAPI - Extracting Passwords
      • From High Integrity to SYSTEM with Name Pipes
      • Integrity Levels
      • JuicyPotato
      • Leaked Handle Exploitation
      • MSI Wrapper
      • Named Pipe Client Impersonation
      • Privilege Escalation with Autoruns
      • RoguePotato, PrintSpoofer, SharpEfsPotato, GodPotato
      • SeDebug + SeImpersonate copy token
      • SeImpersonate from High To System
      • Windows C Payloads
    • Active Directory Methodology
      • Abusing Active Directory ACLs/ACEs
        • Shadow Credentials
      • AD Certificates
        • AD CS Account Persistence
        • AD CS Domain Escalation
        • AD CS Domain Persistence
        • AD CS Certificate Theft
      • AD information in printers
      • AD DNS Records
      • ASREPRoast
      • BloodHound & Other AD Enum Tools
      • Constrained Delegation
      • Custom SSP
      • DCShadow
      • DCSync
      • Diamond Ticket
      • DSRM Credentials
      • External Forest Domain - OneWay (Inbound) or bidirectional
      • External Forest Domain - One-Way (Outbound)
      • Golden Ticket
      • Kerberoast
      • Kerberos Authentication
      • Kerberos Double Hop Problem
      • LAPS
      • MSSQL AD Abuse
      • Over Pass the Hash/Pass the Key
      • Pass the Ticket
      • Password Spraying / Brute Force
      • PrintNightmare
      • Force NTLM Privileged Authentication
      • Privileged Groups
      • RDP Sessions Abuse
      • Resource-based Constrained Delegation
      • Security Descriptors
      • SID-History Injection
      • Silver Ticket
      • Skeleton Key
      • Unconstrained Delegation
    • Windows Security Controls
      • UAC - User Account Control
    • NTLM
      • Places to steal NTLM creds
    • Lateral Movement
      • AtExec / SchtasksExec
      • DCOM Exec
      • PsExec/Winexec/ScExec
      • SmbExec/ScExec
      • WinRM
      • WmiExec
    • Pivoting to the Cloud
    • Stealing Windows Credentials
      • Windows Credentials Protections
      • Mimikatz
      • WTS Impersonator
    • Basic Win CMD for Pentesters
    • Basic PowerShell for Pentesters
      • PowerView/SharpView
    • Antivirus (AV) Bypass
  • 📱Mobile Pentesting
    • Android APK Checklist
    • Android Applications Pentesting
      • Android Applications Basics
      • Android Task Hijacking
      • ADB Commands
      • APK decompilers
      • AVD - Android Virtual Device
      • Bypass Biometric Authentication (Android)
      • content:// protocol
      • Drozer Tutorial
        • Exploiting Content Providers
      • Exploiting a debuggeable application
      • Frida Tutorial
        • Frida Tutorial 1
        • Frida Tutorial 2
        • Frida Tutorial 3
        • Objection Tutorial
      • Google CTF 2018 - Shall We Play a Game?
      • Install Burp Certificate
      • Intent Injection
      • Make APK Accept CA Certificate
      • Manual DeObfuscation
      • React Native Application
      • Reversing Native Libraries
      • Smali - Decompiling/[Modifying]/Compiling
      • Spoofing your location in Play Store
      • Tapjacking
      • Webview Attacks
    • iOS Pentesting Checklist
    • iOS Pentesting
      • iOS App Extensions
      • iOS Basics
      • iOS Basic Testing Operations
      • iOS Burp Suite Configuration
      • iOS Custom URI Handlers / Deeplinks / Custom Schemes
      • iOS Extracting Entitlements From Compiled Application
      • iOS Frida Configuration
      • iOS Hooking With Objection
      • iOS Protocol Handlers
      • iOS Serialisation and Encoding
      • iOS Testing Environment
      • iOS UIActivity Sharing
      • iOS Universal Links
      • iOS UIPasteboard
      • iOS WebViews
    • Cordova Apps
    • Xamarin Apps
  • 👽Network Services Pentesting
    • Pentesting JDWP - Java Debug Wire Protocol
    • Pentesting Printers
    • Pentesting SAP
    • Pentesting VoIP
      • Basic VoIP Protocols
        • SIP (Session Initiation Protocol)
    • Pentesting Remote GdbServer
    • 7/tcp/udp - Pentesting Echo
    • 21 - Pentesting FTP
      • FTP Bounce attack - Scan
      • FTP Bounce - Download 2ºFTP file
    • 22 - Pentesting SSH/SFTP
    • 23 - Pentesting Telnet
    • 25,465,587 - Pentesting SMTP/s
      • SMTP Smuggling
      • SMTP - Commands
    • 43 - Pentesting WHOIS
    • 49 - Pentesting TACACS+
    • 53 - Pentesting DNS
    • 69/UDP TFTP/Bittorrent-tracker
    • 79 - Pentesting Finger
    • 80,443 - Pentesting Web Methodology
      • 403 & 401 Bypasses
      • AEM - Adobe Experience Cloud
      • Angular
      • Apache
      • Artifactory Hacking guide
      • Bolt CMS
      • Buckets
        • Firebase Database
      • CGI
      • DotNetNuke (DNN)
      • Drupal
        • Drupal RCE
      • Electron Desktop Apps
        • Electron contextIsolation RCE via preload code
        • Electron contextIsolation RCE via Electron internal code
        • Electron contextIsolation RCE via IPC
      • Flask
      • NodeJS Express
      • Git
      • Golang
      • GWT - Google Web Toolkit
      • Grafana
      • GraphQL
      • H2 - Java SQL database
      • IIS - Internet Information Services
      • ImageMagick Security
      • JBOSS
      • Jira & Confluence
      • Joomla
      • JSP
      • Laravel
      • Moodle
      • Nginx
      • NextJS
      • PHP Tricks
        • PHP - Useful Functions & disable_functions/open_basedir bypass
          • disable_functions bypass - php-fpm/FastCGI
          • disable_functions bypass - dl function
          • disable_functions bypass - PHP 7.0-7.4 (*nix only)
          • disable_functions bypass - Imagick <= 3.3.0 PHP >= 5.4 Exploit
          • disable_functions - PHP 5.x Shellshock Exploit
          • disable_functions - PHP 5.2.4 ionCube extension Exploit
          • disable_functions bypass - PHP <= 5.2.9 on windows
          • disable_functions bypass - PHP 5.2.4 and 5.2.5 PHP cURL
          • disable_functions bypass - PHP safe_mode bypass via proc_open() and custom environment Exploit
          • disable_functions bypass - PHP Perl Extension Safe_mode Bypass Exploit
          • disable_functions bypass - PHP 5.2.3 - Win32std ext Protections Bypass
          • disable_functions bypass - PHP 5.2 - FOpen Exploit
          • disable_functions bypass - via mem
          • disable_functions bypass - mod_cgi
          • disable_functions bypass - PHP 4 >= 4.2.0, PHP 5 pcntl_exec
        • PHP - RCE abusing object creation: new $_GET["a"]($_GET["b"])
        • PHP SSRF
      • PrestaShop
      • Python
      • Rocket Chat
      • Special HTTP headers
      • Source code Review / SAST Tools
      • Spring Actuators
      • Symfony
      • Tomcat
        • Basic Tomcat Info
      • Uncovering CloudFlare
      • VMWare (ESX, VCenter...)
      • Web API Pentesting
      • WebDav
      • Werkzeug / Flask Debug
      • Wordpress
    • 88tcp/udp - Pentesting Kerberos
      • Harvesting tickets from Windows
      • Harvesting tickets from Linux
    • 110,995 - Pentesting POP
    • 111/TCP/UDP - Pentesting Portmapper
    • 113 - Pentesting Ident
    • 123/udp - Pentesting NTP
    • 135, 593 - Pentesting MSRPC
    • 137,138,139 - Pentesting NetBios
    • 139,445 - Pentesting SMB
      • rpcclient enumeration
    • 143,993 - Pentesting IMAP
    • 161,162,10161,10162/udp - Pentesting SNMP
      • Cisco SNMP
      • SNMP RCE
    • 194,6667,6660-7000 - Pentesting IRC
    • 264 - Pentesting Check Point FireWall-1
    • 389, 636, 3268, 3269 - Pentesting LDAP
    • 500/udp - Pentesting IPsec/IKE VPN
    • 502 - Pentesting Modbus
    • 512 - Pentesting Rexec
    • 513 - Pentesting Rlogin
    • 514 - Pentesting Rsh
    • 515 - Pentesting Line Printer Daemon (LPD)
    • 548 - Pentesting Apple Filing Protocol (AFP)
    • 554,8554 - Pentesting RTSP
    • 623/UDP/TCP - IPMI
    • 631 - Internet Printing Protocol(IPP)
    • 700 - Pentesting EPP
    • 873 - Pentesting Rsync
    • 1026 - Pentesting Rusersd
    • 1080 - Pentesting Socks
    • 1098/1099/1050 - Pentesting Java RMI - RMI-IIOP
    • 1414 - Pentesting IBM MQ
    • 1433 - Pentesting MSSQL - Microsoft SQL Server
      • Types of MSSQL Users
    • 1521,1522-1529 - Pentesting Oracle TNS Listener
    • 1723 - Pentesting PPTP
    • 1883 - Pentesting MQTT (Mosquitto)
    • 2049 - Pentesting NFS Service
    • 2301,2381 - Pentesting Compaq/HP Insight Manager
    • 2375, 2376 Pentesting Docker
    • 3128 - Pentesting Squid
    • 3260 - Pentesting ISCSI
    • 3299 - Pentesting SAPRouter
    • 3306 - Pentesting Mysql
    • 3389 - Pentesting RDP
    • 3632 - Pentesting distcc
    • 3690 - Pentesting Subversion (svn server)
    • 3702/UDP - Pentesting WS-Discovery
    • 4369 - Pentesting Erlang Port Mapper Daemon (epmd)
    • 4786 - Cisco Smart Install
    • 4840 - OPC Unified Architecture
    • 5000 - Pentesting Docker Registry
    • 5353/UDP Multicast DNS (mDNS) and DNS-SD
    • 5432,5433 - Pentesting Postgresql
    • 5439 - Pentesting Redshift
    • 5555 - Android Debug Bridge
    • 5601 - Pentesting Kibana
    • 5671,5672 - Pentesting AMQP
    • 5800,5801,5900,5901 - Pentesting VNC
    • 5984,6984 - Pentesting CouchDB
    • 5985,5986 - Pentesting WinRM
    • 5985,5986 - Pentesting OMI
    • 6000 - Pentesting X11
    • 6379 - Pentesting Redis
    • 8009 - Pentesting Apache JServ Protocol (AJP)
    • 8086 - Pentesting InfluxDB
    • 8089 - Pentesting Splunkd
    • 8333,18333,38333,18444 - Pentesting Bitcoin
    • 9000 - Pentesting FastCGI
    • 9001 - Pentesting HSQLDB
    • 9042/9160 - Pentesting Cassandra
    • 9100 - Pentesting Raw Printing (JetDirect, AppSocket, PDL-datastream)
    • 9200 - Pentesting Elasticsearch
    • 10000 - Pentesting Network Data Management Protocol (ndmp)
    • 11211 - Pentesting Memcache
      • Memcache Commands
    • 15672 - Pentesting RabbitMQ Management
    • 24007,24008,24009,49152 - Pentesting GlusterFS
    • 27017,27018 - Pentesting MongoDB
    • 44134 - Pentesting Tiller (Helm)
    • 44818/UDP/TCP - Pentesting EthernetIP
    • 47808/udp - Pentesting BACNet
    • 50030,50060,50070,50075,50090 - Pentesting Hadoop
  • 🕸️Pentesting Web
    • Web Vulnerabilities Methodology
    • Reflecting Techniques - PoCs and Polygloths CheatSheet
      • Web Vulns List
    • 2FA/MFA/OTP Bypass
    • Account Takeover
    • Browser Extension Pentesting Methodology
      • BrowExt - ClickJacking
      • BrowExt - permissions & host_permissions
      • BrowExt - XSS Example
    • Bypass Payment Process
    • Captcha Bypass
    • Cache Poisoning and Cache Deception
      • Cache Poisoning via URL discrepancies
      • Cache Poisoning to DoS
    • Clickjacking
    • Client Side Template Injection (CSTI)
    • Client Side Path Traversal
    • Command Injection
    • Content Security Policy (CSP) Bypass
      • CSP bypass: self + 'unsafe-inline' with Iframes
    • Cookies Hacking
      • Cookie Tossing
      • Cookie Jar Overflow
      • Cookie Bomb
    • CORS - Misconfigurations & Bypass
    • CRLF (%0D%0A) Injection
    • CSRF (Cross Site Request Forgery)
    • Dangling Markup - HTML scriptless injection
      • SS-Leaks
    • Dependency Confusion
    • Deserialization
      • NodeJS - __proto__ & prototype Pollution
        • Client Side Prototype Pollution
        • Express Prototype Pollution Gadgets
        • Prototype Pollution to RCE
      • Java JSF ViewState (.faces) Deserialization
      • Java DNS Deserialization, GadgetProbe and Java Deserialization Scanner
      • Basic Java Deserialization (ObjectInputStream, readObject)
      • PHP - Deserialization + Autoload Classes
      • CommonsCollection1 Payload - Java Transformers to Rutime exec() and Thread Sleep
      • Basic .Net deserialization (ObjectDataProvider gadget, ExpandedWrapper, and Json.Net)
      • Exploiting __VIEWSTATE knowing the secrets
      • Exploiting __VIEWSTATE without knowing the secrets
      • Python Yaml Deserialization
      • JNDI - Java Naming and Directory Interface & Log4Shell
      • Ruby Class Pollution
    • Domain/Subdomain takeover
    • Email Injections
    • File Inclusion/Path traversal
      • phar:// deserialization
      • LFI2RCE via PHP Filters
      • LFI2RCE via Nginx temp files
      • LFI2RCE via PHP_SESSION_UPLOAD_PROGRESS
      • LFI2RCE via Segmentation Fault
      • LFI2RCE via phpinfo()
      • LFI2RCE Via temp file uploads
      • LFI2RCE via Eternal waiting
      • LFI2RCE Via compress.zlib + PHP_STREAM_PREFER_STUDIO + Path Disclosure
    • File Upload
      • PDF Upload - XXE and CORS bypass
    • Formula/CSV/Doc/LaTeX/GhostScript Injection
    • gRPC-Web Pentest
    • HTTP Connection Contamination
    • HTTP Connection Request Smuggling
    • HTTP Request Smuggling / HTTP Desync Attack
      • Browser HTTP Request Smuggling
      • Request Smuggling in HTTP/2 Downgrades
    • HTTP Response Smuggling / Desync
    • Upgrade Header Smuggling
    • hop-by-hop headers
    • IDOR
    • JWT Vulnerabilities (Json Web Tokens)
    • LDAP Injection
    • Login Bypass
      • Login bypass List
    • NoSQL injection
    • OAuth to Account takeover
    • Open Redirect
    • ORM Injection
    • Parameter Pollution
    • Phone Number Injections
    • PostMessage Vulnerabilities
      • Blocking main page to steal postmessage
      • Bypassing SOP with Iframes - 1
      • Bypassing SOP with Iframes - 2
      • Steal postmessage modifying iframe location
    • Proxy / WAF Protections Bypass
    • Race Condition
    • Rate Limit Bypass
    • Registration & Takeover Vulnerabilities
    • Regular expression Denial of Service - ReDoS
    • Reset/Forgotten Password Bypass
    • Reverse Tab Nabbing
    • SAML Attacks
      • SAML Basics
    • Server Side Inclusion/Edge Side Inclusion Injection
    • SQL Injection
      • MS Access SQL Injection
      • MSSQL Injection
      • MySQL injection
        • MySQL File priv to SSRF/RCE
      • Oracle injection
      • Cypher Injection (neo4j)
      • PostgreSQL injection
        • dblink/lo_import data exfiltration
        • PL/pgSQL Password Bruteforce
        • Network - Privesc, Port Scanner and NTLM chanllenge response disclosure
        • Big Binary Files Upload (PostgreSQL)
        • RCE with PostgreSQL Languages
        • RCE with PostgreSQL Extensions
      • SQLMap - CheatSheet
        • Second Order Injection - SQLMap
    • SSRF (Server Side Request Forgery)
      • URL Format Bypass
      • SSRF Vulnerable Platforms
      • Cloud SSRF
    • SSTI (Server Side Template Injection)
      • EL - Expression Language
      • Jinja2 SSTI
    • Timing Attacks
    • Unicode Injection
      • Unicode Normalization
    • UUID Insecurities
    • WebSocket Attacks
    • Web Tool - WFuzz
    • XPATH injection
    • XSLT Server Side Injection (Extensible Stylesheet Language Transformations)
    • XXE - XEE - XML External Entity
    • XSS (Cross Site Scripting)
      • Abusing Service Workers
      • Chrome Cache to XSS
      • Debugging Client Side JS
      • Dom Clobbering
      • DOM Invader
      • DOM XSS
      • Iframes in XSS, CSP and SOP
      • Integer Overflow
      • JS Hoisting
      • Misc JS Tricks & Relevant Info
      • PDF Injection
      • Server Side XSS (Dynamic PDF)
      • Shadow DOM
      • SOME - Same Origin Method Execution
      • Sniff Leak
      • Steal Info JS
      • XSS in Markdown
    • XSSI (Cross-Site Script Inclusion)
    • XS-Search/XS-Leaks
      • Connection Pool Examples
      • Connection Pool by Destination Example
      • Cookie Bomb + Onerror XS Leak
      • URL Max Length - Client Side
      • performance.now example
      • performance.now + Force heavy task
      • Event Loop Blocking + Lazy images
      • JavaScript Execution XS Leak
      • CSS Injection
        • CSS Injection Code
    • Iframe Traps
  • ⛈️Cloud Security
    • Pentesting Kubernetes
    • Pentesting Cloud (AWS, GCP, Az...)
    • Pentesting CI/CD (Github, Jenkins, Terraform...)
  • 😎Hardware/Physical Access
    • Physical Attacks
    • Escaping from KIOSKs
    • Firmware Analysis
      • Bootloader testing
      • Firmware Integrity
  • 🎯Binary Exploitation
    • Basic Stack Binary Exploitation Methodology
      • ELF Basic Information
      • Exploiting Tools
        • PwnTools
    • Stack Overflow
      • Pointer Redirecting
      • Ret2win
        • Ret2win - arm64
      • Stack Shellcode
        • Stack Shellcode - arm64
      • Stack Pivoting - EBP2Ret - EBP chaining
      • Uninitialized Variables
    • ROP - Return Oriented Programing
      • BROP - Blind Return Oriented Programming
      • Ret2csu
      • Ret2dlresolve
      • Ret2esp / Ret2reg
      • Ret2lib
        • Leaking libc address with ROP
          • Leaking libc - template
        • One Gadget
        • Ret2lib + Printf leak - arm64
      • Ret2syscall
        • Ret2syscall - ARM64
      • Ret2vDSO
      • SROP - Sigreturn-Oriented Programming
        • SROP - ARM64
    • Array Indexing
    • Integer Overflow
    • Format Strings
      • Format Strings - Arbitrary Read Example
      • Format Strings Template
    • Libc Heap
      • Bins & Memory Allocations
      • Heap Memory Functions
        • free
        • malloc & sysmalloc
        • unlink
        • Heap Functions Security Checks
      • Use After Free
        • First Fit
      • Double Free
      • Overwriting a freed chunk
      • Heap Overflow
      • Unlink Attack
      • Fast Bin Attack
      • Unsorted Bin Attack
      • Large Bin Attack
      • Tcache Bin Attack
      • Off by one overflow
      • House of Spirit
      • House of Lore | Small bin Attack
      • House of Einherjar
      • House of Force
      • House of Orange
      • House of Rabbit
      • House of Roman
    • Common Binary Exploitation Protections & Bypasses
      • ASLR
        • Ret2plt
        • Ret2ret & Reo2pop
      • CET & Shadow Stack
      • Libc Protections
      • Memory Tagging Extension (MTE)
      • No-exec / NX
      • PIE
        • BF Addresses in the Stack
      • Relro
      • Stack Canaries
        • BF Forked & Threaded Stack Canaries
        • Print Stack Canary
    • Write What Where 2 Exec
      • WWW2Exec - atexit()
      • WWW2Exec - .dtors & .fini_array
      • WWW2Exec - GOT/PLT
      • WWW2Exec - __malloc_hook & __free_hook
    • Common Exploiting Problems
    • Windows Exploiting (Basic Guide - OSCP lvl)
    • iOS Exploiting
  • 🔩Reversing
    • Reversing Tools & Basic Methods
      • Angr
        • Angr - Examples
      • Z3 - Satisfiability Modulo Theories (SMT)
      • Cheat Engine
      • Blobrunner
    • Common API used in Malware
    • Word Macros
  • 🔮Crypto & Stego
    • Cryptographic/Compression Algorithms
      • Unpacking binaries
    • Certificates
    • Cipher Block Chaining CBC-MAC
    • Crypto CTFs Tricks
    • Electronic Code Book (ECB)
    • Hash Length Extension Attack
    • Padding Oracle
    • RC4 - Encrypt&Decrypt
    • Stego Tricks
    • Esoteric languages
    • Blockchain & Crypto Currencies
  • 🦂C2
    • Salseo
    • ICMPsh
    • Cobalt Strike
  • ✍️TODO
    • Other Big References
    • Rust Basics
    • More Tools
    • MISC
    • Pentesting DNS
    • Hardware Hacking
      • I2C
      • UART
      • Radio
      • JTAG
      • SPI
    • Industrial Control Systems Hacking
      • Modbus Protocol
    • Radio Hacking
      • Pentesting RFID
      • Infrared
      • Sub-GHz RF
      • iButton
      • Flipper Zero
        • FZ - NFC
        • FZ - Sub-GHz
        • FZ - Infrared
        • FZ - iButton
        • FZ - 125kHz RFID
      • Proxmark 3
      • FISSURE - The RF Framework
      • Low-Power Wide Area Network
      • Pentesting BLE - Bluetooth Low Energy
    • Industrial Control Systems Hacking
    • Test LLMs
    • LLM Training
      • 0. Basic LLM Concepts
      • 1. Tokenizing
      • 2. Data Sampling
      • 3. Token Embeddings
      • 4. Attention Mechanisms
      • 5. LLM Architecture
      • 6. Pre-training & Loading models
      • 7.0. LoRA Improvements in fine-tuning
      • 7.1. Fine-Tuning for Classification
      • 7.2. Fine-Tuning to follow instructions
    • Burp Suite
    • Other Web Tricks
    • Interesting HTTP
    • Android Forensics
    • TR-069
    • 6881/udp - Pentesting BitTorrent
    • Online Platforms with API
    • Stealing Sensitive Information Disclosure from a Web
    • Post Exploitation
    • Investment Terms
    • Cookies Policy
Powered by GitBook
On this page
  • LLM Architecture
  • Code representation
  • GELU Activation Function
  • FeedForward Neural Network
  • Multi-Head Attention Mechanism
  • Layer Normalization
  • Transformer Block
  • GPTModel
  • Number of Parameters to train
  • Step-by-Step Calculation
  • Generate Text
  • References
Edit on GitHub
  1. TODO
  2. LLM Training

5. LLM Architecture

Previous4. Attention MechanismsNext6. Pre-training & Loading models

Last updated 7 months ago

LLM Architecture

The goal of this fifth phase is very simple: Develop the architecture of the full LLM. Put everything together, apply all the layers and create all the functions to generate text or transform text to IDs and backwards.

This architecture will be used for both, training and predicting text after it was trained.

LLM architecture example from :

A high level representation can be observed in:

  1. Input (Tokenized Text): The process begins with tokenized text, which is converted into numerical representations.

  2. Token Embedding and Positional Embedding Layer: The tokenized text is passed through a token embedding layer and a positional embedding layer, which captures the position of tokens in a sequence, critical for understanding word order.

  3. Transformer Blocks: The model contains 12 transformer blocks, each with multiple layers. These blocks repeat the following sequence:

    • Masked Multi-Head Attention: Allows the model to focus on different parts of the input text at once.

    • Layer Normalization: A normalization step to stabilize and improve training.

    • Feed Forward Layer: Responsible for processing the information from the attention layer and making predictions about the next token.

    • Dropout Layers: These layers prevent overfitting by randomly dropping units during training.

  4. Final Output Layer: The model outputs a 4x50,257-dimensional tensor, where 50,257 represents the size of the vocabulary. Each row in this tensor corresponds to a vector that the model uses to predict the next word in the sequence.

  5. Goal: The objective is to take these embeddings and convert them back into text. Specifically, the last row of the output is used to generate the next word, represented as "forward" in this diagram.

Code representation

import torch
import torch.nn as nn
import tiktoken

class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) * 
            (x + 0.044715 * torch.pow(x, 3))
        ))

class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
            GELU(),
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
        )

    def forward(self, x):
        return self.layers(x)

class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert d_out % num_heads == 0, "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
        self.dropout = nn.Dropout(dropout)
        self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))

    def forward(self, x):
        b, num_tokens, d_in = x.shape

        keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
        queries = self.W_query(x)
        values = self.W_value(x)

        # We implicitly split the matrix by adding a `num_heads` dimension
        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) 
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # Compute scaled dot-product attention (aka self-attention) with a causal mask
        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head
        
        # Original mask truncated to the number of tokens and converted to boolean
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        # Use the mask to fill attention scores
        attn_scores.masked_fill_(mask_bool, -torch.inf)
        
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Shape: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2) 
        
        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec) # optional projection

        return context_vec

class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift

class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.att = MultiHeadAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            num_heads=cfg["n_heads"], 
            dropout=cfg["drop_rate"],
            qkv_bias=cfg["qkv_bias"])
        self.ff = FeedForward(cfg)
        self.norm1 = LayerNorm(cfg["emb_dim"])
        self.norm2 = LayerNorm(cfg["emb_dim"])
        self.drop_shortcut = nn.Dropout(cfg["drop_rate"])

    def forward(self, x):
        # Shortcut connection for attention block
        shortcut = x
        x = self.norm1(x)
        x = self.att(x)  # Shape [batch_size, num_tokens, emb_size]
        x = self.drop_shortcut(x)
        x = x + shortcut  # Add the original input back

        # Shortcut connection for feed forward block
        shortcut = x
        x = self.norm2(x)
        x = self.ff(x)
        x = self.drop_shortcut(x)
        x = x + shortcut  # Add the original input back

        return x

    
class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])
        
        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
        
        self.final_norm = LayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds  # Shape [batch_size, num_tokens, emb_size]
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 1024, # Context length
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of layers
    "drop_rate": 0.1,       # Dropout rate
    "qkv_bias": False       # Query-Key-Value bias
}

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
out = model(batch)
print("Input batch:\n", batch)
print("\nOutput shape:", out.shape)
print(out)

Let's explain it step by step:

GELU Activation Function

# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) * 
            (x + 0.044715 * torch.pow(x, 3))
        ))

Purpose and Functionality

  • GELU (Gaussian Error Linear Unit): An activation function that introduces non-linearity into the model.

  • Smooth Activation: Unlike ReLU, which zeroes out negative inputs, GELU smoothly maps inputs to outputs, allowing for small, non-zero values for negative inputs.

  • Mathematical Definition:

The goal of the use of this function after linear layers inside the FeedForward layer is to change the linear data to be none linear to allow the model to learn complex, non-linear relationships.

FeedForward Neural Network

Shapes have been added as comments to understand better the shapes of matrices:

# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
            GELU(),
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
        )

    def forward(self, x):
        # x shape: (batch_size, seq_len, emb_dim)
        
        x = self.layers[0](x)# x shape: (batch_size, seq_len, 4 * emb_dim)
        x = self.layers[1](x) # x shape remains: (batch_size, seq_len, 4 * emb_dim)
        x = self.layers[2](x) # x shape: (batch_size, seq_len, emb_dim)
        return x  # Output shape: (batch_size, seq_len, emb_dim)

Purpose and Functionality

  • Position-wise FeedForward Network: Applies a two-layer fully connected network to each position separately and identically.

  • Layer Details:

    • First Linear Layer: Expands the dimensionality from emb_dim to 4 * emb_dim.

    • GELU Activation: Applies non-linearity.

    • Second Linear Layer: Reduces the dimensionality back to emb_dim.

As you can see, the Feed Forward network uses 3 layers. The first one is a linear layer that will multiply the dimensions by 4 using linear weights (parameters to train inside the model). Then, the GELU function is used in all those dimensions to apply none-linear variations to capture richer representations and finally another linear layer is used to get back to the original size of dimensions.

Multi-Head Attention Mechanism

This was already explained in an earlier section.

Purpose and Functionality

  • Multi-Head Self-Attention: Allows the model to focus on different positions within the input sequence when encoding a token.

  • Key Components:

    • Queries, Keys, Values: Linear projections of the input, used to compute attention scores.

    • Heads: Multiple attention mechanisms running in parallel (num_heads), each with a reduced dimension (head_dim).

    • Attention Scores: Computed as the dot product of queries and keys, scaled and masked.

    • Masking: A causal mask is applied to prevent the model from attending to future tokens (important for autoregressive models like GPT).

    • Attention Weights: Softmax of the masked and scaled attention scores.

    • Context Vector: Weighted sum of the values, according to attention weights.

    • Output Projection: Linear layer to combine the outputs of all heads.

The goal of this network is to find the relations between tokens in the same context. Moreover, the tokens are divided in different heads in order to prevent overfitting although the final relations found per head are combined at the end of this network.

Moreover, during training a causal mask is applied so later tokens are not taken into account when looking the specific relations to a token and some dropout is also applied to prevent overfitting.

Layer Normalization

# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5 # Prevent division by zero during normalization.
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift

Purpose and Functionality

  • Layer Normalization: A technique used to normalize the inputs across the features (embedding dimensions) for each individual example in a batch.

  • Components:

    • eps: A small constant (1e-5) added to the variance to prevent division by zero during normalization.

    • scale and shift: Learnable parameters (nn.Parameter) that allow the model to scale and shift the normalized output. They are initialized to ones and zeros, respectively.

  • Normalization Process:

    • Compute Mean (mean): Calculates the mean of the input x across the embedding dimension (dim=-1), keeping the dimension for broadcasting (keepdim=True).

    • Compute Variance (var): Calculates the variance of x across the embedding dimension, also keeping the dimension. The unbiased=False parameter ensures that the variance is calculated using the biased estimator (dividing by N instead of N-1), which is appropriate when normalizing over features rather than samples.

    • Normalize (norm_x): Subtracts the mean from x and divides by the square root of the variance plus eps.

    • Scale and Shift: Applies the learnable scale and shift parameters to the normalized output.

The goal is to ensure a mean of 0 with a variance of 1 across all dimensions of the same token . The goal of this is to stabilize the training of deep neural networks by reducing the internal covariate shift, which refers to the change in the distribution of network activations due to the updating of parameters during training.

Transformer Block

Shapes have been added as comments to understand better the shapes of matrices:

# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04

class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.att = MultiHeadAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            num_heads=cfg["n_heads"],
            dropout=cfg["drop_rate"],
            qkv_bias=cfg["qkv_bias"]
        )
        self.ff = FeedForward(cfg)
        self.norm1 = LayerNorm(cfg["emb_dim"])
        self.norm2 = LayerNorm(cfg["emb_dim"])
        self.drop_shortcut = nn.Dropout(cfg["drop_rate"])

    def forward(self, x):
        # x shape: (batch_size, seq_len, emb_dim)

        # Shortcut connection for attention block
        shortcut = x  # shape: (batch_size, seq_len, emb_dim)
        x = self.norm1(x)  # shape remains (batch_size, seq_len, emb_dim)
        x = self.att(x)    # shape: (batch_size, seq_len, emb_dim)
        x = self.drop_shortcut(x)  # shape remains (batch_size, seq_len, emb_dim)
        x = x + shortcut   # shape: (batch_size, seq_len, emb_dim)

        # Shortcut connection for feedforward block
        shortcut = x       # shape: (batch_size, seq_len, emb_dim)
        x = self.norm2(x)  # shape remains (batch_size, seq_len, emb_dim)
        x = self.ff(x)     # shape: (batch_size, seq_len, emb_dim)
        x = self.drop_shortcut(x)  # shape remains (batch_size, seq_len, emb_dim)
        x = x + shortcut   # shape: (batch_size, seq_len, emb_dim)

        return x  # Output shape: (batch_size, seq_len, emb_dim)

Purpose and Functionality

  • Composition of Layers: Combines multi-head attention, feedforward network, layer normalization, and residual connections.

  • Layer Normalization: Applied before the attention and feedforward layers for stable training.

  • Residual Connections (Shortcuts): Add the input of a layer to its output to improve gradient flow and enable training of deep networks.

  • Dropout: Applied after attention and feedforward layers for regularization.

Step-by-Step Functionality

  1. First Residual Path (Self-Attention):

    • Input (shortcut): Save the original input for the residual connection.

    • Layer Norm (norm1): Normalize the input.

    • Multi-Head Attention (att): Apply self-attention.

    • Dropout (drop_shortcut): Apply dropout for regularization.

    • Add Residual (x + shortcut): Combine with the original input.

  2. Second Residual Path (FeedForward):

    • Input (shortcut): Save the updated input for the next residual connection.

    • Layer Norm (norm2): Normalize the input.

    • FeedForward Network (ff): Apply the feedforward transformation.

    • Dropout (drop_shortcut): Apply dropout.

    • Add Residual (x + shortcut): Combine with the input from the first residual path.

The transformer block groups all the networks together and applies some normalization and dropouts to improve the training stability and results. Note how dropouts are done after the use of each network while normalization is applied before.

Moreover, it also uses shortcuts which consists on adding the output of a network with its input. This helps to prevent the vanishing gradient problem by making sure that initial layers contribute "as much" as the last ones.

GPTModel

Shapes have been added as comments to understand better the shapes of matrices:

# From https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04
class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        # shape: (vocab_size, emb_dim)

        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        # shape: (context_length, emb_dim)

        self.drop_emb = nn.Dropout(cfg["drop_rate"])

        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]
        )
        # Stack of TransformerBlocks

        self.final_norm = LayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)
        # shape: (emb_dim, vocab_size)

    def forward(self, in_idx):
        # in_idx shape: (batch_size, seq_len)
        batch_size, seq_len = in_idx.shape

        # Token embeddings
        tok_embeds = self.tok_emb(in_idx)
        # shape: (batch_size, seq_len, emb_dim)

        # Positional embeddings
        pos_indices = torch.arange(seq_len, device=in_idx.device)
        # shape: (seq_len,)
        pos_embeds = self.pos_emb(pos_indices)
        # shape: (seq_len, emb_dim)

        # Add token and positional embeddings
        x = tok_embeds + pos_embeds  # Broadcasting over batch dimension
        # x shape: (batch_size, seq_len, emb_dim)

        x = self.drop_emb(x)  # Dropout applied
        # x shape remains: (batch_size, seq_len, emb_dim)

        x = self.trf_blocks(x)  # Pass through Transformer blocks
        # x shape remains: (batch_size, seq_len, emb_dim)

        x = self.final_norm(x)  # Final LayerNorm
        # x shape remains: (batch_size, seq_len, emb_dim)

        logits = self.out_head(x)  # Project to vocabulary size
        # logits shape: (batch_size, seq_len, vocab_size)

        return logits  # Output shape: (batch_size, seq_len, vocab_size)

Purpose and Functionality

  • Embedding Layers:

    • Token Embeddings (tok_emb): Converts token indices into embeddings. As reminder, these are the weights given to each dimension of each token in the vocabulary.

    • Positional Embeddings (pos_emb): Adds positional information to the embeddings to capture the order of tokens. As reminder, these are the weights given to token according to it's position in the text.

  • Dropout (drop_emb): Applied to embeddings for regularisation.

  • Transformer Blocks (trf_blocks): Stack of n_layers transformer blocks to process embeddings.

  • Final Normalization (final_norm): Layer normalization before the output layer.

  • Output Layer (out_head): Projects the final hidden states to the vocabulary size to produce logits for prediction.

The goal of this class is to use all the other mentioned networks to predict the next token in a sequence, which is fundamental for tasks like text generation.

Note how it will use as many transformer blocks as indicated and that each transformer block is using one multi-head attestation net, one feed forward net and several normalizations. So if 12 transformer blocks are used, multiply this by 12.

Moreover, a normalization layer is added before the output and a final linear layer is applied a the end to get the results with the proper dimensions. Note how each final vector has the size of the used vocabulary. This is because it's trying to get a probability per possible token inside the vocabulary.

Number of Parameters to train

Having the GPT structure defined it's possible to find out the number of parameters to train:

GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 1024, # Context length
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of layers
    "drop_rate": 0.1,       # Dropout rate
    "qkv_bias": False       # Query-Key-Value bias
}

model = GPTModel(GPT_CONFIG_124M)
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")
# Total number of parameters: 163,009,536

Step-by-Step Calculation

1. Embedding Layers: Token Embedding & Position Embedding

  • Layer: nn.Embedding(vocab_size, emb_dim)

  • Parameters: vocab_size * emb_dim

token_embedding_params = 50257 * 768 = 38,597,376
  • Layer: nn.Embedding(context_length, emb_dim)

  • Parameters: context_length * emb_dim

position_embedding_params = 1024 * 768 = 786,432

Total Embedding Parameters

embedding_params = token_embedding_params + position_embedding_params
embedding_params = 38,597,376 + 786,432 = 39,383,808

2. Transformer Blocks

There are 12 transformer blocks, so we'll calculate the parameters for one block and then multiply by 12.

Parameters per Transformer Block

a. Multi-Head Attention

  • Components:

    • Query Linear Layer (W_query): nn.Linear(emb_dim, emb_dim, bias=False)

    • Key Linear Layer (W_key): nn.Linear(emb_dim, emb_dim, bias=False)

    • Value Linear Layer (W_value): nn.Linear(emb_dim, emb_dim, bias=False)

    • Output Projection (out_proj): nn.Linear(emb_dim, emb_dim)

  • Calculations:

    • Each of W_query, W_key, W_value:

      qkv_params = emb_dim * emb_dim = 768 * 768 = 589,824

      Since there are three such layers:

      total_qkv_params = 3 * qkv_params = 3 * 589,824 = 1,769,472
    • Output Projection (out_proj):

      out_proj_params = (emb_dim * emb_dim) + emb_dim = (768 * 768) + 768 = 589,824 + 768 = 590,592
    • Total Multi-Head Attention Parameters:

      mha_params = total_qkv_params + out_proj_params
      mha_params = 1,769,472 + 590,592 = 2,360,064

b. FeedForward Network

  • Components:

    • First Linear Layer: nn.Linear(emb_dim, 4 * emb_dim)

    • Second Linear Layer: nn.Linear(4 * emb_dim, emb_dim)

  • Calculations:

    • First Linear Layer:

      ff_first_layer_params = (emb_dim * 4 * emb_dim) + (4 * emb_dim)
      ff_first_layer_params = (768 * 3072) + 3072 = 2,359,296 + 3,072 = 2,362,368
    • Second Linear Layer:

      ff_second_layer_params = (4 * emb_dim * emb_dim) + emb_dim
      ff_second_layer_params = (3072 * 768) + 768 = 2,359,296 + 768 = 2,360,064
    • Total FeedForward Parameters:

      ff_params = ff_first_layer_params + ff_second_layer_params
      ff_params = 2,362,368 + 2,360,064 = 4,722,432

c. Layer Normalizations

  • Components:

    • Two LayerNorm instances per block.

    • Each LayerNorm has 2 * emb_dim parameters (scale and shift).

  • Calculations:

    pythonCopy codelayer_norm_params_per_block = 2 * (2 * emb_dim) = 2 * 768 * 2 = 3,072

d. Total Parameters per Transformer Block

pythonCopy codeparams_per_block = mha_params + ff_params + layer_norm_params_per_block
params_per_block = 2,360,064 + 4,722,432 + 3,072 = 7,085,568

Total Parameters for All Transformer Blocks

pythonCopy codetotal_transformer_blocks_params = params_per_block * n_layers
total_transformer_blocks_params = 7,085,568 * 12 = 85,026,816

3. Final Layers

a. Final Layer Normalization

  • Parameters: 2 * emb_dim (scale and shift)

pythonCopy codefinal_layer_norm_params = 2 * 768 = 1,536

b. Output Projection Layer (out_head)

  • Layer: nn.Linear(emb_dim, vocab_size, bias=False)

  • Parameters: emb_dim * vocab_size

pythonCopy codeoutput_projection_params = 768 * 50257 = 38,597,376

4. Summing Up All Parameters

pythonCopy codetotal_params = (
    embedding_params +
    total_transformer_blocks_params +
    final_layer_norm_params +
    output_projection_params
)
total_params = (
    39,383,808 +
    85,026,816 +
    1,536 +
    38,597,376
)
total_params = 163,009,536

Generate Text

Having a model that predicts the next token like the one before, it's just needed to take the last token values from the output (as they will be the ones of the predicted token), which will be a value per entry in the vocabulary and then use the softmax function to normalize the dimensions into probabilities that sums 1 and then get the index of the of the biggest entry, which will be the index of the word inside the vocabulary.

def generate_text_simple(model, idx, max_new_tokens, context_size):
    # idx is (batch, n_tokens) array of indices in the current context
    for _ in range(max_new_tokens):
        
        # Crop current context if it exceeds the supported context size
        # E.g., if LLM supports only 5 tokens, and the context size is 10
        # then only the last 5 tokens are used as context
        idx_cond = idx[:, -context_size:]
        
        # Get the predictions
        with torch.no_grad():
            logits = model(idx_cond)
        
        # Focus only on the last time step
        # (batch, n_tokens, vocab_size) becomes (batch, vocab_size)
        logits = logits[:, -1, :]  

        # Apply softmax to get probabilities
        probas = torch.softmax(logits, dim=-1)  # (batch, vocab_size)

        # Get the idx of the vocab entry with the highest probability value
        idx_next = torch.argmax(probas, dim=-1, keepdim=True)  # (batch, 1)

        # Append sampled index to the running sequence
        idx = torch.cat((idx, idx_next), dim=1)  # (batch, n_tokens+1)

    return idx


start_context = "Hello, I am"

encoded = tokenizer.encode(start_context)
print("encoded:", encoded)

encoded_tensor = torch.tensor(encoded).unsqueeze(0)
print("encoded_tensor.shape:", encoded_tensor.shape)

model.eval() # disable dropout

out = generate_text_simple(
    model=model,
    idx=encoded_tensor, 
    max_new_tokens=6, 
    context_size=GPT_CONFIG_124M["context_length"]
)

print("Output:", out)
print("Output length:", len(out[0]))

References

Code from :

✍️
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/01_main-chapter-code/ch04.ipynb
https://www.manning.com/books/build-a-large-language-model-from-scratch
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch04/01_main-chapter-code/ch04.ipynb
https://camo.githubusercontent.com/6c8c392f72d5b9e86c94aeb9470beab435b888d24135926f1746eb88e0cc18fb/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830345f636f6d707265737365642f31332e776562703f31