🤖
hacktricks
  • 👾Welcome!
    • HackTricks
    • HackTricks Values & FAQ
    • About the author
  • 🤩Generic Methodologies & Resources
    • Pentesting Methodology
    • External Recon Methodology
      • Wide Source Code Search
      • Github Dorks & Leaks
    • Pentesting Network
      • DHCPv6
      • EIGRP Attacks
      • GLBP & HSRP Attacks
      • IDS and IPS Evasion
      • Lateral VLAN Segmentation Bypass
      • Network Protocols Explained (ESP)
      • Nmap Summary (ESP)
      • Pentesting IPv6
      • WebRTC DoS
      • Spoofing LLMNR, NBT-NS, mDNS/DNS and WPAD and Relay Attacks
      • Spoofing SSDP and UPnP Devices with EvilSSDP
    • Pentesting Wifi
      • Evil Twin EAP-TLS
    • Phishing Methodology
      • Clone a Website
      • Detecting Phishing
      • Phishing Files & Documents
    • Basic Forensic Methodology
      • Baseline Monitoring
      • Anti-Forensic Techniques
      • Docker Forensics
      • Image Acquisition & Mount
      • Linux Forensics
      • Malware Analysis
      • Memory dump analysis
        • Volatility - CheatSheet
      • Partitions/File Systems/Carving
        • File/Data Carving & Recovery Tools
      • Pcap Inspection
        • DNSCat pcap analysis
        • Suricata & Iptables cheatsheet
        • USB Keystrokes
        • Wifi Pcap Analysis
        • Wireshark tricks
      • Specific Software/File-Type Tricks
        • Decompile compiled python binaries (exe, elf) - Retreive from .pyc
        • Browser Artifacts
        • Deofuscation vbs (cscript.exe)
        • Local Cloud Storage
        • Office file analysis
        • PDF File analysis
        • PNG tricks
        • Video and Audio file analysis
        • ZIPs tricks
      • Windows Artifacts
        • Interesting Windows Registry Keys
    • Brute Force - CheatSheet
    • Python Sandbox Escape & Pyscript
      • Bypass Python sandboxes
        • LOAD_NAME / LOAD_CONST opcode OOB Read
      • Class Pollution (Python's Prototype Pollution)
      • Python Internal Read Gadgets
      • Pyscript
      • venv
      • Web Requests
      • Bruteforce hash (few chars)
      • Basic Python
    • Exfiltration
    • Tunneling and Port Forwarding
    • Threat Modeling
    • Search Exploits
    • Reverse Shells (Linux, Windows, MSFVenom)
      • MSFVenom - CheatSheet
      • Reverse Shells - Windows
      • Reverse Shells - Linux
      • Full TTYs
  • 🐧Linux Hardening
    • Checklist - Linux Privilege Escalation
    • Linux Privilege Escalation
      • Arbitrary File Write to Root
      • Cisco - vmanage
      • Containerd (ctr) Privilege Escalation
      • D-Bus Enumeration & Command Injection Privilege Escalation
      • Docker Security
        • Abusing Docker Socket for Privilege Escalation
        • AppArmor
        • AuthZ& AuthN - Docker Access Authorization Plugin
        • CGroups
        • Docker --privileged
        • Docker Breakout / Privilege Escalation
          • release_agent exploit - Relative Paths to PIDs
          • Docker release_agent cgroups escape
          • Sensitive Mounts
        • Namespaces
          • CGroup Namespace
          • IPC Namespace
          • PID Namespace
          • Mount Namespace
          • Network Namespace
          • Time Namespace
          • User Namespace
          • UTS Namespace
        • Seccomp
        • Weaponizing Distroless
      • Escaping from Jails
      • euid, ruid, suid
      • Interesting Groups - Linux Privesc
        • lxd/lxc Group - Privilege escalation
      • Logstash
      • ld.so privesc exploit example
      • Linux Active Directory
      • Linux Capabilities
      • NFS no_root_squash/no_all_squash misconfiguration PE
      • Node inspector/CEF debug abuse
      • Payloads to execute
      • RunC Privilege Escalation
      • SELinux
      • Socket Command Injection
      • Splunk LPE and Persistence
      • SSH Forward Agent exploitation
      • Wildcards Spare tricks
    • Useful Linux Commands
    • Bypass Linux Restrictions
      • Bypass FS protections: read-only / no-exec / Distroless
        • DDexec / EverythingExec
    • Linux Environment Variables
    • Linux Post-Exploitation
      • PAM - Pluggable Authentication Modules
    • FreeIPA Pentesting
  • 🍏MacOS Hardening
    • macOS Security & Privilege Escalation
      • macOS Apps - Inspecting, debugging and Fuzzing
        • Objects in memory
        • Introduction to x64
        • Introduction to ARM64v8
      • macOS AppleFS
      • macOS Bypassing Firewalls
      • macOS Defensive Apps
      • macOS GCD - Grand Central Dispatch
      • macOS Kernel & System Extensions
        • macOS IOKit
        • macOS Kernel Extensions & Debugging
        • macOS Kernel Vulnerabilities
        • macOS System Extensions
      • macOS Network Services & Protocols
      • macOS File Extension & URL scheme app handlers
      • macOS Files, Folders, Binaries & Memory
        • macOS Bundles
        • macOS Installers Abuse
        • macOS Memory Dumping
        • macOS Sensitive Locations & Interesting Daemons
        • macOS Universal binaries & Mach-O Format
      • macOS Objective-C
      • macOS Privilege Escalation
      • macOS Process Abuse
        • macOS Dirty NIB
        • macOS Chromium Injection
        • macOS Electron Applications Injection
        • macOS Function Hooking
        • macOS IPC - Inter Process Communication
          • macOS MIG - Mach Interface Generator
          • macOS XPC
            • macOS XPC Authorization
            • macOS XPC Connecting Process Check
              • macOS PID Reuse
              • macOS xpc_connection_get_audit_token Attack
          • macOS Thread Injection via Task port
        • macOS Java Applications Injection
        • macOS Library Injection
          • macOS Dyld Hijacking & DYLD_INSERT_LIBRARIES
          • macOS Dyld Process
        • macOS Perl Applications Injection
        • macOS Python Applications Injection
        • macOS Ruby Applications Injection
        • macOS .Net Applications Injection
      • macOS Security Protections
        • macOS Gatekeeper / Quarantine / XProtect
        • macOS Launch/Environment Constraints & Trust Cache
        • macOS Sandbox
          • macOS Default Sandbox Debug
          • macOS Sandbox Debug & Bypass
            • macOS Office Sandbox Bypasses
        • macOS Authorizations DB & Authd
        • macOS SIP
        • macOS TCC
          • macOS Apple Events
          • macOS TCC Bypasses
            • macOS Apple Scripts
          • macOS TCC Payloads
        • macOS Dangerous Entitlements & TCC perms
        • macOS - AMFI - AppleMobileFileIntegrity
        • macOS MACF - Mandatory Access Control Framework
        • macOS Code Signing
        • macOS FS Tricks
          • macOS xattr-acls extra stuff
      • macOS Users & External Accounts
    • macOS Red Teaming
      • macOS MDM
        • Enrolling Devices in Other Organisations
        • macOS Serial Number
      • macOS Keychain
    • macOS Useful Commands
    • macOS Auto Start
  • 🪟Windows Hardening
    • Checklist - Local Windows Privilege Escalation
    • Windows Local Privilege Escalation
      • Abusing Tokens
      • Access Tokens
      • ACLs - DACLs/SACLs/ACEs
      • AppendData/AddSubdirectory permission over service registry
      • Create MSI with WIX
      • COM Hijacking
      • Dll Hijacking
        • Writable Sys Path +Dll Hijacking Privesc
      • DPAPI - Extracting Passwords
      • From High Integrity to SYSTEM with Name Pipes
      • Integrity Levels
      • JuicyPotato
      • Leaked Handle Exploitation
      • MSI Wrapper
      • Named Pipe Client Impersonation
      • Privilege Escalation with Autoruns
      • RoguePotato, PrintSpoofer, SharpEfsPotato, GodPotato
      • SeDebug + SeImpersonate copy token
      • SeImpersonate from High To System
      • Windows C Payloads
    • Active Directory Methodology
      • Abusing Active Directory ACLs/ACEs
        • Shadow Credentials
      • AD Certificates
        • AD CS Account Persistence
        • AD CS Domain Escalation
        • AD CS Domain Persistence
        • AD CS Certificate Theft
      • AD information in printers
      • AD DNS Records
      • ASREPRoast
      • BloodHound & Other AD Enum Tools
      • Constrained Delegation
      • Custom SSP
      • DCShadow
      • DCSync
      • Diamond Ticket
      • DSRM Credentials
      • External Forest Domain - OneWay (Inbound) or bidirectional
      • External Forest Domain - One-Way (Outbound)
      • Golden Ticket
      • Kerberoast
      • Kerberos Authentication
      • Kerberos Double Hop Problem
      • LAPS
      • MSSQL AD Abuse
      • Over Pass the Hash/Pass the Key
      • Pass the Ticket
      • Password Spraying / Brute Force
      • PrintNightmare
      • Force NTLM Privileged Authentication
      • Privileged Groups
      • RDP Sessions Abuse
      • Resource-based Constrained Delegation
      • Security Descriptors
      • SID-History Injection
      • Silver Ticket
      • Skeleton Key
      • Unconstrained Delegation
    • Windows Security Controls
      • UAC - User Account Control
    • NTLM
      • Places to steal NTLM creds
    • Lateral Movement
      • AtExec / SchtasksExec
      • DCOM Exec
      • PsExec/Winexec/ScExec
      • SmbExec/ScExec
      • WinRM
      • WmiExec
    • Pivoting to the Cloud
    • Stealing Windows Credentials
      • Windows Credentials Protections
      • Mimikatz
      • WTS Impersonator
    • Basic Win CMD for Pentesters
    • Basic PowerShell for Pentesters
      • PowerView/SharpView
    • Antivirus (AV) Bypass
  • 📱Mobile Pentesting
    • Android APK Checklist
    • Android Applications Pentesting
      • Android Applications Basics
      • Android Task Hijacking
      • ADB Commands
      • APK decompilers
      • AVD - Android Virtual Device
      • Bypass Biometric Authentication (Android)
      • content:// protocol
      • Drozer Tutorial
        • Exploiting Content Providers
      • Exploiting a debuggeable application
      • Frida Tutorial
        • Frida Tutorial 1
        • Frida Tutorial 2
        • Frida Tutorial 3
        • Objection Tutorial
      • Google CTF 2018 - Shall We Play a Game?
      • Install Burp Certificate
      • Intent Injection
      • Make APK Accept CA Certificate
      • Manual DeObfuscation
      • React Native Application
      • Reversing Native Libraries
      • Smali - Decompiling/[Modifying]/Compiling
      • Spoofing your location in Play Store
      • Tapjacking
      • Webview Attacks
    • iOS Pentesting Checklist
    • iOS Pentesting
      • iOS App Extensions
      • iOS Basics
      • iOS Basic Testing Operations
      • iOS Burp Suite Configuration
      • iOS Custom URI Handlers / Deeplinks / Custom Schemes
      • iOS Extracting Entitlements From Compiled Application
      • iOS Frida Configuration
      • iOS Hooking With Objection
      • iOS Protocol Handlers
      • iOS Serialisation and Encoding
      • iOS Testing Environment
      • iOS UIActivity Sharing
      • iOS Universal Links
      • iOS UIPasteboard
      • iOS WebViews
    • Cordova Apps
    • Xamarin Apps
  • 👽Network Services Pentesting
    • Pentesting JDWP - Java Debug Wire Protocol
    • Pentesting Printers
    • Pentesting SAP
    • Pentesting VoIP
      • Basic VoIP Protocols
        • SIP (Session Initiation Protocol)
    • Pentesting Remote GdbServer
    • 7/tcp/udp - Pentesting Echo
    • 21 - Pentesting FTP
      • FTP Bounce attack - Scan
      • FTP Bounce - Download 2ºFTP file
    • 22 - Pentesting SSH/SFTP
    • 23 - Pentesting Telnet
    • 25,465,587 - Pentesting SMTP/s
      • SMTP Smuggling
      • SMTP - Commands
    • 43 - Pentesting WHOIS
    • 49 - Pentesting TACACS+
    • 53 - Pentesting DNS
    • 69/UDP TFTP/Bittorrent-tracker
    • 79 - Pentesting Finger
    • 80,443 - Pentesting Web Methodology
      • 403 & 401 Bypasses
      • AEM - Adobe Experience Cloud
      • Angular
      • Apache
      • Artifactory Hacking guide
      • Bolt CMS
      • Buckets
        • Firebase Database
      • CGI
      • DotNetNuke (DNN)
      • Drupal
        • Drupal RCE
      • Electron Desktop Apps
        • Electron contextIsolation RCE via preload code
        • Electron contextIsolation RCE via Electron internal code
        • Electron contextIsolation RCE via IPC
      • Flask
      • NodeJS Express
      • Git
      • Golang
      • GWT - Google Web Toolkit
      • Grafana
      • GraphQL
      • H2 - Java SQL database
      • IIS - Internet Information Services
      • ImageMagick Security
      • JBOSS
      • Jira & Confluence
      • Joomla
      • JSP
      • Laravel
      • Moodle
      • Nginx
      • NextJS
      • PHP Tricks
        • PHP - Useful Functions & disable_functions/open_basedir bypass
          • disable_functions bypass - php-fpm/FastCGI
          • disable_functions bypass - dl function
          • disable_functions bypass - PHP 7.0-7.4 (*nix only)
          • disable_functions bypass - Imagick <= 3.3.0 PHP >= 5.4 Exploit
          • disable_functions - PHP 5.x Shellshock Exploit
          • disable_functions - PHP 5.2.4 ionCube extension Exploit
          • disable_functions bypass - PHP <= 5.2.9 on windows
          • disable_functions bypass - PHP 5.2.4 and 5.2.5 PHP cURL
          • disable_functions bypass - PHP safe_mode bypass via proc_open() and custom environment Exploit
          • disable_functions bypass - PHP Perl Extension Safe_mode Bypass Exploit
          • disable_functions bypass - PHP 5.2.3 - Win32std ext Protections Bypass
          • disable_functions bypass - PHP 5.2 - FOpen Exploit
          • disable_functions bypass - via mem
          • disable_functions bypass - mod_cgi
          • disable_functions bypass - PHP 4 >= 4.2.0, PHP 5 pcntl_exec
        • PHP - RCE abusing object creation: new $_GET["a"]($_GET["b"])
        • PHP SSRF
      • PrestaShop
      • Python
      • Rocket Chat
      • Special HTTP headers
      • Source code Review / SAST Tools
      • Spring Actuators
      • Symfony
      • Tomcat
        • Basic Tomcat Info
      • Uncovering CloudFlare
      • VMWare (ESX, VCenter...)
      • Web API Pentesting
      • WebDav
      • Werkzeug / Flask Debug
      • Wordpress
    • 88tcp/udp - Pentesting Kerberos
      • Harvesting tickets from Windows
      • Harvesting tickets from Linux
    • 110,995 - Pentesting POP
    • 111/TCP/UDP - Pentesting Portmapper
    • 113 - Pentesting Ident
    • 123/udp - Pentesting NTP
    • 135, 593 - Pentesting MSRPC
    • 137,138,139 - Pentesting NetBios
    • 139,445 - Pentesting SMB
      • rpcclient enumeration
    • 143,993 - Pentesting IMAP
    • 161,162,10161,10162/udp - Pentesting SNMP
      • Cisco SNMP
      • SNMP RCE
    • 194,6667,6660-7000 - Pentesting IRC
    • 264 - Pentesting Check Point FireWall-1
    • 389, 636, 3268, 3269 - Pentesting LDAP
    • 500/udp - Pentesting IPsec/IKE VPN
    • 502 - Pentesting Modbus
    • 512 - Pentesting Rexec
    • 513 - Pentesting Rlogin
    • 514 - Pentesting Rsh
    • 515 - Pentesting Line Printer Daemon (LPD)
    • 548 - Pentesting Apple Filing Protocol (AFP)
    • 554,8554 - Pentesting RTSP
    • 623/UDP/TCP - IPMI
    • 631 - Internet Printing Protocol(IPP)
    • 700 - Pentesting EPP
    • 873 - Pentesting Rsync
    • 1026 - Pentesting Rusersd
    • 1080 - Pentesting Socks
    • 1098/1099/1050 - Pentesting Java RMI - RMI-IIOP
    • 1414 - Pentesting IBM MQ
    • 1433 - Pentesting MSSQL - Microsoft SQL Server
      • Types of MSSQL Users
    • 1521,1522-1529 - Pentesting Oracle TNS Listener
    • 1723 - Pentesting PPTP
    • 1883 - Pentesting MQTT (Mosquitto)
    • 2049 - Pentesting NFS Service
    • 2301,2381 - Pentesting Compaq/HP Insight Manager
    • 2375, 2376 Pentesting Docker
    • 3128 - Pentesting Squid
    • 3260 - Pentesting ISCSI
    • 3299 - Pentesting SAPRouter
    • 3306 - Pentesting Mysql
    • 3389 - Pentesting RDP
    • 3632 - Pentesting distcc
    • 3690 - Pentesting Subversion (svn server)
    • 3702/UDP - Pentesting WS-Discovery
    • 4369 - Pentesting Erlang Port Mapper Daemon (epmd)
    • 4786 - Cisco Smart Install
    • 4840 - OPC Unified Architecture
    • 5000 - Pentesting Docker Registry
    • 5353/UDP Multicast DNS (mDNS) and DNS-SD
    • 5432,5433 - Pentesting Postgresql
    • 5439 - Pentesting Redshift
    • 5555 - Android Debug Bridge
    • 5601 - Pentesting Kibana
    • 5671,5672 - Pentesting AMQP
    • 5800,5801,5900,5901 - Pentesting VNC
    • 5984,6984 - Pentesting CouchDB
    • 5985,5986 - Pentesting WinRM
    • 5985,5986 - Pentesting OMI
    • 6000 - Pentesting X11
    • 6379 - Pentesting Redis
    • 8009 - Pentesting Apache JServ Protocol (AJP)
    • 8086 - Pentesting InfluxDB
    • 8089 - Pentesting Splunkd
    • 8333,18333,38333,18444 - Pentesting Bitcoin
    • 9000 - Pentesting FastCGI
    • 9001 - Pentesting HSQLDB
    • 9042/9160 - Pentesting Cassandra
    • 9100 - Pentesting Raw Printing (JetDirect, AppSocket, PDL-datastream)
    • 9200 - Pentesting Elasticsearch
    • 10000 - Pentesting Network Data Management Protocol (ndmp)
    • 11211 - Pentesting Memcache
      • Memcache Commands
    • 15672 - Pentesting RabbitMQ Management
    • 24007,24008,24009,49152 - Pentesting GlusterFS
    • 27017,27018 - Pentesting MongoDB
    • 44134 - Pentesting Tiller (Helm)
    • 44818/UDP/TCP - Pentesting EthernetIP
    • 47808/udp - Pentesting BACNet
    • 50030,50060,50070,50075,50090 - Pentesting Hadoop
  • 🕸️Pentesting Web
    • Web Vulnerabilities Methodology
    • Reflecting Techniques - PoCs and Polygloths CheatSheet
      • Web Vulns List
    • 2FA/MFA/OTP Bypass
    • Account Takeover
    • Browser Extension Pentesting Methodology
      • BrowExt - ClickJacking
      • BrowExt - permissions & host_permissions
      • BrowExt - XSS Example
    • Bypass Payment Process
    • Captcha Bypass
    • Cache Poisoning and Cache Deception
      • Cache Poisoning via URL discrepancies
      • Cache Poisoning to DoS
    • Clickjacking
    • Client Side Template Injection (CSTI)
    • Client Side Path Traversal
    • Command Injection
    • Content Security Policy (CSP) Bypass
      • CSP bypass: self + 'unsafe-inline' with Iframes
    • Cookies Hacking
      • Cookie Tossing
      • Cookie Jar Overflow
      • Cookie Bomb
    • CORS - Misconfigurations & Bypass
    • CRLF (%0D%0A) Injection
    • CSRF (Cross Site Request Forgery)
    • Dangling Markup - HTML scriptless injection
      • SS-Leaks
    • Dependency Confusion
    • Deserialization
      • NodeJS - __proto__ & prototype Pollution
        • Client Side Prototype Pollution
        • Express Prototype Pollution Gadgets
        • Prototype Pollution to RCE
      • Java JSF ViewState (.faces) Deserialization
      • Java DNS Deserialization, GadgetProbe and Java Deserialization Scanner
      • Basic Java Deserialization (ObjectInputStream, readObject)
      • PHP - Deserialization + Autoload Classes
      • CommonsCollection1 Payload - Java Transformers to Rutime exec() and Thread Sleep
      • Basic .Net deserialization (ObjectDataProvider gadget, ExpandedWrapper, and Json.Net)
      • Exploiting __VIEWSTATE knowing the secrets
      • Exploiting __VIEWSTATE without knowing the secrets
      • Python Yaml Deserialization
      • JNDI - Java Naming and Directory Interface & Log4Shell
      • Ruby Class Pollution
    • Domain/Subdomain takeover
    • Email Injections
    • File Inclusion/Path traversal
      • phar:// deserialization
      • LFI2RCE via PHP Filters
      • LFI2RCE via Nginx temp files
      • LFI2RCE via PHP_SESSION_UPLOAD_PROGRESS
      • LFI2RCE via Segmentation Fault
      • LFI2RCE via phpinfo()
      • LFI2RCE Via temp file uploads
      • LFI2RCE via Eternal waiting
      • LFI2RCE Via compress.zlib + PHP_STREAM_PREFER_STUDIO + Path Disclosure
    • File Upload
      • PDF Upload - XXE and CORS bypass
    • Formula/CSV/Doc/LaTeX/GhostScript Injection
    • gRPC-Web Pentest
    • HTTP Connection Contamination
    • HTTP Connection Request Smuggling
    • HTTP Request Smuggling / HTTP Desync Attack
      • Browser HTTP Request Smuggling
      • Request Smuggling in HTTP/2 Downgrades
    • HTTP Response Smuggling / Desync
    • Upgrade Header Smuggling
    • hop-by-hop headers
    • IDOR
    • JWT Vulnerabilities (Json Web Tokens)
    • LDAP Injection
    • Login Bypass
      • Login bypass List
    • NoSQL injection
    • OAuth to Account takeover
    • Open Redirect
    • ORM Injection
    • Parameter Pollution
    • Phone Number Injections
    • PostMessage Vulnerabilities
      • Blocking main page to steal postmessage
      • Bypassing SOP with Iframes - 1
      • Bypassing SOP with Iframes - 2
      • Steal postmessage modifying iframe location
    • Proxy / WAF Protections Bypass
    • Race Condition
    • Rate Limit Bypass
    • Registration & Takeover Vulnerabilities
    • Regular expression Denial of Service - ReDoS
    • Reset/Forgotten Password Bypass
    • Reverse Tab Nabbing
    • SAML Attacks
      • SAML Basics
    • Server Side Inclusion/Edge Side Inclusion Injection
    • SQL Injection
      • MS Access SQL Injection
      • MSSQL Injection
      • MySQL injection
        • MySQL File priv to SSRF/RCE
      • Oracle injection
      • Cypher Injection (neo4j)
      • PostgreSQL injection
        • dblink/lo_import data exfiltration
        • PL/pgSQL Password Bruteforce
        • Network - Privesc, Port Scanner and NTLM chanllenge response disclosure
        • Big Binary Files Upload (PostgreSQL)
        • RCE with PostgreSQL Languages
        • RCE with PostgreSQL Extensions
      • SQLMap - CheatSheet
        • Second Order Injection - SQLMap
    • SSRF (Server Side Request Forgery)
      • URL Format Bypass
      • SSRF Vulnerable Platforms
      • Cloud SSRF
    • SSTI (Server Side Template Injection)
      • EL - Expression Language
      • Jinja2 SSTI
    • Timing Attacks
    • Unicode Injection
      • Unicode Normalization
    • UUID Insecurities
    • WebSocket Attacks
    • Web Tool - WFuzz
    • XPATH injection
    • XSLT Server Side Injection (Extensible Stylesheet Language Transformations)
    • XXE - XEE - XML External Entity
    • XSS (Cross Site Scripting)
      • Abusing Service Workers
      • Chrome Cache to XSS
      • Debugging Client Side JS
      • Dom Clobbering
      • DOM Invader
      • DOM XSS
      • Iframes in XSS, CSP and SOP
      • Integer Overflow
      • JS Hoisting
      • Misc JS Tricks & Relevant Info
      • PDF Injection
      • Server Side XSS (Dynamic PDF)
      • Shadow DOM
      • SOME - Same Origin Method Execution
      • Sniff Leak
      • Steal Info JS
      • XSS in Markdown
    • XSSI (Cross-Site Script Inclusion)
    • XS-Search/XS-Leaks
      • Connection Pool Examples
      • Connection Pool by Destination Example
      • Cookie Bomb + Onerror XS Leak
      • URL Max Length - Client Side
      • performance.now example
      • performance.now + Force heavy task
      • Event Loop Blocking + Lazy images
      • JavaScript Execution XS Leak
      • CSS Injection
        • CSS Injection Code
    • Iframe Traps
  • ⛈️Cloud Security
    • Pentesting Kubernetes
    • Pentesting Cloud (AWS, GCP, Az...)
    • Pentesting CI/CD (Github, Jenkins, Terraform...)
  • 😎Hardware/Physical Access
    • Physical Attacks
    • Escaping from KIOSKs
    • Firmware Analysis
      • Bootloader testing
      • Firmware Integrity
  • 🎯Binary Exploitation
    • Basic Stack Binary Exploitation Methodology
      • ELF Basic Information
      • Exploiting Tools
        • PwnTools
    • Stack Overflow
      • Pointer Redirecting
      • Ret2win
        • Ret2win - arm64
      • Stack Shellcode
        • Stack Shellcode - arm64
      • Stack Pivoting - EBP2Ret - EBP chaining
      • Uninitialized Variables
    • ROP - Return Oriented Programing
      • BROP - Blind Return Oriented Programming
      • Ret2csu
      • Ret2dlresolve
      • Ret2esp / Ret2reg
      • Ret2lib
        • Leaking libc address with ROP
          • Leaking libc - template
        • One Gadget
        • Ret2lib + Printf leak - arm64
      • Ret2syscall
        • Ret2syscall - ARM64
      • Ret2vDSO
      • SROP - Sigreturn-Oriented Programming
        • SROP - ARM64
    • Array Indexing
    • Integer Overflow
    • Format Strings
      • Format Strings - Arbitrary Read Example
      • Format Strings Template
    • Libc Heap
      • Bins & Memory Allocations
      • Heap Memory Functions
        • free
        • malloc & sysmalloc
        • unlink
        • Heap Functions Security Checks
      • Use After Free
        • First Fit
      • Double Free
      • Overwriting a freed chunk
      • Heap Overflow
      • Unlink Attack
      • Fast Bin Attack
      • Unsorted Bin Attack
      • Large Bin Attack
      • Tcache Bin Attack
      • Off by one overflow
      • House of Spirit
      • House of Lore | Small bin Attack
      • House of Einherjar
      • House of Force
      • House of Orange
      • House of Rabbit
      • House of Roman
    • Common Binary Exploitation Protections & Bypasses
      • ASLR
        • Ret2plt
        • Ret2ret & Reo2pop
      • CET & Shadow Stack
      • Libc Protections
      • Memory Tagging Extension (MTE)
      • No-exec / NX
      • PIE
        • BF Addresses in the Stack
      • Relro
      • Stack Canaries
        • BF Forked & Threaded Stack Canaries
        • Print Stack Canary
    • Write What Where 2 Exec
      • WWW2Exec - atexit()
      • WWW2Exec - .dtors & .fini_array
      • WWW2Exec - GOT/PLT
      • WWW2Exec - __malloc_hook & __free_hook
    • Common Exploiting Problems
    • Windows Exploiting (Basic Guide - OSCP lvl)
    • iOS Exploiting
  • 🔩Reversing
    • Reversing Tools & Basic Methods
      • Angr
        • Angr - Examples
      • Z3 - Satisfiability Modulo Theories (SMT)
      • Cheat Engine
      • Blobrunner
    • Common API used in Malware
    • Word Macros
  • 🔮Crypto & Stego
    • Cryptographic/Compression Algorithms
      • Unpacking binaries
    • Certificates
    • Cipher Block Chaining CBC-MAC
    • Crypto CTFs Tricks
    • Electronic Code Book (ECB)
    • Hash Length Extension Attack
    • Padding Oracle
    • RC4 - Encrypt&Decrypt
    • Stego Tricks
    • Esoteric languages
    • Blockchain & Crypto Currencies
  • 🦂C2
    • Salseo
    • ICMPsh
    • Cobalt Strike
  • ✍️TODO
    • Other Big References
    • Rust Basics
    • More Tools
    • MISC
    • Pentesting DNS
    • Hardware Hacking
      • I2C
      • UART
      • Radio
      • JTAG
      • SPI
    • Industrial Control Systems Hacking
      • Modbus Protocol
    • Radio Hacking
      • Pentesting RFID
      • Infrared
      • Sub-GHz RF
      • iButton
      • Flipper Zero
        • FZ - NFC
        • FZ - Sub-GHz
        • FZ - Infrared
        • FZ - iButton
        • FZ - 125kHz RFID
      • Proxmark 3
      • FISSURE - The RF Framework
      • Low-Power Wide Area Network
      • Pentesting BLE - Bluetooth Low Energy
    • Industrial Control Systems Hacking
    • Test LLMs
    • LLM Training
      • 0. Basic LLM Concepts
      • 1. Tokenizing
      • 2. Data Sampling
      • 3. Token Embeddings
      • 4. Attention Mechanisms
      • 5. LLM Architecture
      • 6. Pre-training & Loading models
      • 7.0. LoRA Improvements in fine-tuning
      • 7.1. Fine-Tuning for Classification
      • 7.2. Fine-Tuning to follow instructions
    • Burp Suite
    • Other Web Tricks
    • Interesting HTTP
    • Android Forensics
    • TR-069
    • 6881/udp - Pentesting BitTorrent
    • Online Platforms with API
    • Stealing Sensitive Information Disclosure from a Web
    • Post Exploitation
    • Investment Terms
    • Cookies Policy
Powered by GitBook
On this page
  • Token Embeddings
  • What Are Token Embeddings?
  • Initializing Token Embeddings
  • How Token Embeddings Work During Training
  • Positional Embeddings: Adding Context to Token Embeddings
  • Why Positional Embeddings Are Needed:
  • Types of Positional Embeddings:
  • How Positional Embeddings Are Integrated:
  • Code Example
  • References
Edit on GitHub
  1. TODO
  2. LLM Training

3. Token Embeddings

Token Embeddings

After tokenizing text data, the next critical step in preparing data for training large language models (LLMs) like GPT is creating token embeddings. Token embeddings transform discrete tokens (such as words or subwords) into continuous numerical vectors that the model can process and learn from. This explanation breaks down token embeddings, their initialization, usage, and the role of positional embeddings in enhancing model understanding of token sequences.

The goal of this third phase is very simple: Assign each of the previous tokens in the vocabulary a vector of the desired dimensions to train the model. Each word in the vocabulary will a point in a space of X dimensions. Note that initially the position of each word in the space is just initialised "randomly" and these positions are trainable parameters (will be improved during the training).

Moreover, during the token embedding another layer of embeddings is created which represents (in this case) the absolute possition of the word in the training sentence. This way a word in different positions in the sentence will have a different representation (meaning).

What Are Token Embeddings?

Token Embeddings are numerical representations of tokens in a continuous vector space. Each token in the vocabulary is associated with a unique vector of fixed dimensions. These vectors capture semantic and syntactic information about the tokens, enabling the model to understand relationships and patterns in the data.

  • Vocabulary Size: The total number of unique tokens (e.g., words, subwords) in the model’s vocabulary.

  • Embedding Dimensions: The number of numerical values (dimensions) in each token’s vector. Higher dimensions can capture more nuanced information but require more computational resources.

Example:

  • Vocabulary Size: 6 tokens [1, 2, 3, 4, 5, 6]

  • Embedding Dimensions: 3 (x, y, z)

Initializing Token Embeddings

At the start of training, token embeddings are typically initialized with small random values. These initial values are adjusted (fine-tuned) during training to better represent the tokens' meanings based on the training data.

PyTorch Example:

import torch

# Set a random seed for reproducibility
torch.manual_seed(123)

# Create an embedding layer with 6 tokens and 3 dimensions
embedding_layer = torch.nn.Embedding(6, 3)

# Display the initial weights (embeddings)
print(embedding_layer.weight)

Output:

luaCopy codeParameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)

Explanation:

  • Each row corresponds to a token in the vocabulary.

  • Each column represents a dimension in the embedding vector.

  • For example, the token at index 3 has an embedding vector [-0.4015, 0.9666, -1.1481].

Accessing a Token’s Embedding:

# Retrieve the embedding for the token at index 3
token_index = torch.tensor([3])
print(embedding_layer(token_index))

Output:

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)

Interpretation:

  • The token at index 3 is represented by the vector [-0.4015, 0.9666, -1.1481].

  • These values are trainable parameters that the model will adjust during training to better represent the token's context and meaning.

How Token Embeddings Work During Training

During training, each token in the input data is converted into its corresponding embedding vector. These vectors are then used in various computations within the model, such as attention mechanisms and neural network layers.

Example Scenario:

  • Batch Size: 8 (number of samples processed simultaneously)

  • Max Sequence Length: 4 (number of tokens per sample)

  • Embedding Dimensions: 256

Data Structure:

  • Each batch is represented as a 3D tensor with shape (batch_size, max_length, embedding_dim).

  • For our example, the shape would be (8, 4, 256).

Visualization:

cssCopy codeBatch
┌─────────────┐
│ Sample 1    │
│ ┌─────┐     │
│ │Token│ → [x₁₁, x₁₂, ..., x₁₂₅₆]
│ │ 1   │     │
│ │...  │     │
│ │Token│     │
│ │ 4   │     │
│ └─────┘     │
│ Sample 2    │
│ ┌─────┐     │
│ │Token│ → [x₂₁, x₂₂, ..., x₂₂₅₆]
│ │ 1   │     │
│ │...  │     │
│ │Token│     │
│ │ 4   │     │
│ └─────┘     │
│ ...         │
│ Sample 8    │
│ ┌─────┐     │
│ │Token│ → [x₈₁, x₈₂, ..., x₈₂₅₆]
│ │ 1   │     │
│ │...  │     │
│ │Token│     │
│ │ 4   │     │
│ └─────┘     │
└─────────────┘

Explanation:

  • Each token in the sequence is represented by a 256-dimensional vector.

  • The model processes these embeddings to learn language patterns and generate predictions.

Positional Embeddings: Adding Context to Token Embeddings

While token embeddings capture the meaning of individual tokens, they do not inherently encode the position of tokens within a sequence. Understanding the order of tokens is crucial for language comprehension. This is where positional embeddings come into play.

Why Positional Embeddings Are Needed:

  • Token Order Matters: In sentences, the meaning often depends on the order of words. For example, "The cat sat on the mat" vs. "The mat sat on the cat."

  • Embedding Limitation: Without positional information, the model treats tokens as a "bag of words," ignoring their sequence.

Types of Positional Embeddings:

  1. Absolute Positional Embeddings:

    • Assign a unique position vector to each position in the sequence.

    • Example: The first token in any sequence has the same positional embedding, the second token has another, and so on.

    • Used By: OpenAI’s GPT models.

  2. Relative Positional Embeddings:

    • Encode the relative distance between tokens rather than their absolute positions.

    • Example: Indicate how far apart two tokens are, regardless of their absolute positions in the sequence.

    • Used By: Models like Transformer-XL and some variants of BERT.

How Positional Embeddings Are Integrated:

  • Same Dimensions: Positional embeddings have the same dimensionality as token embeddings.

  • Addition: They are added to token embeddings, combining token identity with positional information without increasing the overall dimensionality.

Example of Adding Positional Embeddings:

Suppose a token embedding vector is [0.5, -0.2, 0.1] and its positional embedding vector is [0.1, 0.3, -0.1]. The combined embedding used by the model would be:

Combined Embedding = Token Embedding + Positional Embedding
                   = [0.5 + 0.1, -0.2 + 0.3, 0.1 + (-0.1)]
                   = [0.6, 0.1, 0.0]

Benefits of Positional Embeddings:

  • Contextual Awareness: The model can differentiate between tokens based on their positions.

  • Sequence Understanding: Enables the model to understand grammar, syntax, and context-dependent meanings.

Code Example

# Use previous code...

# Create dimensional emdeddings
"""
BPE uses a vocabulary of 50257 words
Let's supose we want to use 256 dimensions (instead of the millions used by LLMs)
"""

vocab_size = 50257
output_dim = 256
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

## Generate the dataloader like before
max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)

# Apply embeddings
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)
torch.Size([8, 4, 256]) # 8 x 4 x 256

# Generate absolute embeddings
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

pos_embeddings = pos_embedding_layer(torch.arange(max_length))

input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape) # torch.Size([8, 4, 256])

References

Previous2. Data SamplingNext4. Attention Mechanisms

Last updated 7 months ago

Following with the code example from :

✍️
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb
https://www.manning.com/books/build-a-large-language-model-from-scratch