عجفت الغور

linux

Tags: computers

  • Currently Linux plumber and hobbiest occupy two different workds
    • Desktop linux and DevOps middleware are paid employees, vs poeple who use suckless softawre, musl-based distros

Greybeard Qualification

Process

Internal datastructure struct task_struct {}:

  • pid -> integer, will wrap around unused pids
  • parent -> every proces has a parent, each one has a children
  • tty -> in the old days, every process had an associated terminal called a tty
  • uid -> user the process is running as
  • gid -> group id
  • A process can change its ids (setuid)
  • From a sysadmin pov, what can you do to a process?
    • create/send signals/ps or proc/strace
    • each process has an address space, every process thinks it owns the entire theoretical memory of the computer

memory

  • initialized and uninitalized data

  • break system call is used to increase heap size

  • heap sits on top and grows up and the stack grows down

  • in between space is used for shared libraries

  • 32 bit 4GB

  • environment variables and cmdline is at the very top

  • kernel puts all of its dataat the top of the data structure

  • pmap

    • permissions on the memory regions like file
    • limits
      • used to be important during timeshare systems
      • rlimit or ulimit for stack size
        • common one is the number of open files
    • process priority
      • nice in shell
      • static priority
      • dynamic priority
      • real-time priority
      • all related to scheduling
      • nice or getpriority() or setpriority() system calls

IPC

  • pipes

    • internal buffer data structure
    • two file descriptors, connected to each end of the pipe
    • similar to a channel in go
    • no temp file at all, all in memory
    • one limitation: only exist between members of the same family within the same process three
      • ls | wc works because both ls and wc are children of the chell
    • FIFOs/named pipes, exist in the filesystem, acts like a regular pipe
      • used for multiple processes
    • cron uses a fifo in the kernel?
  • signals

    • Killing a process can be done with kill -s <SIGNAL> pid, send SIGQUIT to request a craceful quit or SIGABRT to request immediate termination
      • 32 original flavors
      • 32 new flavors
      • list with kill -l
      • process can send signals to another process, or kernel sends it to your process
      • common
        • 9 - kill
        • 15 - term
        • 1 - HUP - hangup (for when you have a connection), now it’s used to reload config files
        • 2 - INT - interrupt
        • 19 - STOP - suspend in background
          • fg command to see
        • 18 - CONT
        • 14 - ALARM - timer in kernel, when the process is sent an alarm signal
  • System V stuff

    • semget - semaphore

      • SystemV semaphores can be manipulated as sets, and provides more control than POSIX ones
    • msget - message service get

    • shmget - shared memory

      • most efficient is one setting data and another is reading
    • RPC - literally a remote procedure call, sent over to a different process usually on another process

    • mmap - a way to map a file into memory, then shared with another process, BSD trick

      • you map a chunk of memory out to a file, and then its propagated, then you can have a set of memory that’s not tied to a particular file, then another process can share this file
        • anon memory, not tied to a particular process, going to be paged out to swap space

Execution, Scheduling, Processes, and Threads

file descriptor

  • integer that represents a kernel data structure that has an inode, read position, etc
  • pipes
  • it’s possible to redirect the inputs and outputs of the program (another great idea of linux)
  • now the child can something do different, including exec() a new program
    • execve() runs a new program, usually fork and exec is the pattern (although fork is bad now)
      • apache web server for example forks
    • bash execs two programs and
    • copy on write fork()
      • since you’re just gonna exec, why bother copying the whole thing?

daemons?

  • fork a child process, then the parent exits and leaves the child with parent as the init
  • How do become a daemon?
    • close all open files
    • become the process group leader
      • process group, by default, a processes of the login shells is in the group
      • signals get passed to everyone in the process group, such as shells hanging up
    • set the umask
      • what are the permissions I want to be sure that are turned off by default?
    • change the dir to a safe space
      • change your working directory to somewhere safe, the daemon should be in a place that’s somewhere safe

scheduling

  • linux is multi-processing, but they have to share one CPU

    • At any time, each of the CPU’s in a system can be
      1. Not associated with any process, serving and hardware interrupt (Hard IRQs)
        • Timer ticks, network cards, keyboards
        • Handlers here have to be fast, most of the time it is an ack
      2. Not associated with any process, serving a softirq or tasklet (Software interrupt context)
        • Much of the real interrupt handling work is done here
        • Tasklets are dynamically registrable versions of software soft interrupt contexts, which means you can have as many as yo want, and any tasklet will only run on one CPU at any time
      3. Running in kernel space, associated with a process (user context)
      4. Running a process in user space
    • Bottom two can preempt each other, but above is a strict hierarchy, kernel process cannot preempt a softirq, a soft irq cannot preempt a hardware interrupt
  • scheduling is how the kernel manages these running processes and “take turns”

    • “context switching”, take all the registers values out, load the register values
      • instruction pointer
      • where is the stack? this kind of stuff
      • who gets to run next?
        • who’s ready to run?
        • every process has a task state associated with it
        • when simply running, CPU-bound, the kernel will preempt a process
        • everything is in a run queue
        • process is in the task_running state, has a certain time quantum to run, by default 100ms
          • kernel checks the process time quantum at every tick, causes the kernel to check on the state, if the process hasn’t exited or slept, guarentees that a process doesn’t run forever
          • realistically it’s 1 ms now
        • preemptive multitasking
          • all TASK_RUNNING processes are part of the runqueue

            • length of the runqueue is called the “load average”
            • uptime shows load averaged command
              • last minute, 5 minutes, 15 minutes averages

                00:43:10 up 122 days, 35 min, 3 users, load average: 0.10, 0.10, 0.09

            • vmstat also does this with r
              procs -----------memory-------- -- ---swap-- -----io---- -system-- ------cpu-----
               r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
               0  0      0 65022788 1420664 60968808    0    0     0     9    0    0  3  0 97  0  0
              
          • high load average means you’re trying to do too much on the system probably

    • states
      • TASK_RUNNING
      • TASK_INTERRUPTABLE
      • TASK_UNINTERRUPTABLE
      • TASK_STOPPED (control z)
      • TASK_TRACED (strace)
      • EXIT_ZOMBIE (the process has ended, but remains in the tree)
        • zombies are only leftover datastructures in the kernel
        • only reason to keep it around is so that the parent can retrieve its return value
        • is not running CPU or mem
        • usually a sign that the someone hasn’t cleaned up
        • orphaned zombies become children of init, but the init will clean it up
      • EXIT_DEAD
        • process has excited, but pages still need to be cleaned up

Memory Management

Virtual Memory

  • We’re not compressing images anymore
  • But still multiprocessing, so we need to maintain individual process space
  • Memory mapping is a little different for the 64 bit versions
  • Memory mapping gets changed all the times

Paging

  • Page sizes traditionally 4kb, but really if you’re a big boy we’re all huge pages now

  • There’s a page table, and we represent the address space of the program and how it’s mapped into the actual ram

  • page table entries change logical memory spaces into physical spaces, how they ordered, how they used, varies widely

  • you don’t have to map all the pages, but if they’re not in ram, where are they?

  • kernel allocates the pages, and mapped when a program is loaded at execute time

  • shared libraries (shared among various processes, chances are .so pages are already in RAM, so they map to another process address space)

    • although now static binaries might be good enough since we have tons of ram now
  • context switching doesn’t involve memory changes

    • means switching to different address spaces and different page tables
    • address spaces can sit all in memory with enough ram, just change the register values, no freeing memory
    • pages are copied out to swap (AND HIT DISK)
    • reason why it’s called “swap”, to swap processes between context switches
    • now you just move the pages out as you need to
  • typically a disk partition or a file, and we can have multiple swap spaces

    • mkswap(8) and swapon(8)
    • swapon -s in terms of unix blocks
      • unix blocks are 512 bytes, each block is half a K, although now there’s different
  • Possible to have memory on the swap space and in memory!

  • “dirty page” - means that one copy that has been created is not the same as a the one in swap anymore

    • dirty pages need to reswap out
  • A “page fault” happens when you hit something that is not in RAM, which triggers a copy back from swap

  • happens on demand, which means paging in and out requires a very long wait

  • page ins are synchronous! the worst case

File Mapping and Demand Paging

  • old way: exec the process, create the VM space, copy the text, init data, then swap and copies as needed
  • now /bin file is also copied into swap
  • why not let the program on disk be its own swap space!
  • file mapping - original copy of the program is a disk copy of what’s in memory, and then we reload them back from the file
    • file mapping is swap
  • then we can just map the file into memory, you don’t even need it in memory!
    • big win: only bring in pages as needed

    • mapping is instant, no waiting for the program to load

    • overall a win, demand-paging is good for most usecases (ls for example, you don’t load all the code for all the options)

    • same thing happens for shared libraries -> libc is used

    • anon pages -> not represented by real world files, aka swap space is only needed for dynamically allocated files

    • shared libraries ld.so

    • copy on write

      • when you fork a process, copy on write is a flag set on each page
    • demand paging, waiting for page faults

      • normally: yes, but the alternative is on startup loading
      • turn it off my setting mlock or mlockall, which assures stuff isn’t paged out

Monitoring VM

  • vmstat - first line is always an average (si so = page ins and page outs)
    • block IO is usually paging
  • swpd - memory in swap space
  • free - RAM that’s free

File I/O

  • Solaris also does this
    • solaris has a page scanner that frees up scanner and each version also show scan rate
    • if the page scanner was actually running and desperately scanning then we know what’s the memory problem
  • All file I/O in linux is done via paging
  • Paging isn’t necessarily bad
  • /proc/pid/maps
  • /proc/meminfo
  • /proc/pid/smaps
    • Shows you the size of the memory segment
    • RSS -> how much of this segment is physically present in memory
    • how much of it is shared
    • how much of it isn’t shared
    • how much of it is dirty

RAM and Swap

  • How do you manage memory and free it up?
  • When you’re running low on memory
    • old times you do a clockhand algo, where you mark pages with one clockhand another was one that would free
    • now there’s fancier algorithms, but we still want to get the least recently used ones
  • kswapd -> kernel thread that frees pages
  • now when things are dangerously low, now the OOMKiller comes in
    • Uses an algorithm that tries to pick processes that can be killed
  • if you’re too low on RAM and keep paging in and out, we’re “thrashing”
    • happens on windows more
  • in the old days
    • swap >= ram
  • since now you only need it for anon memory
  • generally VM = RAM + swap
  • You want activity in RAM, with enough ram, you should have zero swap
  • Kernel still keeps page tables in memory if you have extra space
    • I think this is the “cache”
    • Everything that gets paged into memory and kept without looking at disk io

Startup and Init

Boot

  • x86 -> 80x86 architecture
  • raise the value of the RESET pin (on the processor)
    • this goes through and intializes two important registers
    • CS and EIP register
      • CS - address of the stack and some other flags
      • EIP - instruction pointer

BIOS

  • This is coreboot/oreboot/libreboot
  • Interrupt driven process that provides access to the hardware (DOS descended)
  • Not a major part of the linux operating system, only used to get the kernel loaded
  • Linux has its own device drivers and interrupt handlers
  • POST (power on self test)
  • ACPI (advanced config and power interfaces) - builds tables of devices with info
  • initializes hardware - no IRQ or I/O port conflicts. Tabluates PCI (Peripheral Component interconnect) devices
  • Looks for something to boot
  • Devices
    • Have an I/O port that the CPU can do IO to via port addresses
    • When the device is ready to be talked to, it raises an interrupt and the linux kernel starts the driver
    • ISA days
      • part of the trick was picking the devices so that they didn’t conflict with the interrupts
      • PCI helps with this by automating it so that they don’t conflict with each other
    • You can choose what to do boot from now, and what order
  • BIOS looks for the boot sector
    • copy the first sector into RAM (512 bytes per sector), starting at 0x000 7c00
    • then jumps to that address

Loaders LILO and GRUB

  • First sector of a disk is the Master Boot Record (MBR)
  • Containers a partition table and a program
  • Within Linux, this is a boot loader. earlier linuxes had a bootloader (512 bytes)
  • LILO (linux loader) and GRUB (Grand Unified Bootloader)
  • GRUB is much more general purpose, has it’s own shell and can read file systems
  • You can’t fit everything into 512 byte boot sector, so multistage
  • First part is 0x0000 7c00, then it jumps to 0x0009 6a00, which then sets up the Real Mode stack
  • Second part into 0x0009 6c00, which reads the OS list and lets the user choose the OS to boot
  • If it’s linux, it displays loading, then loads the 512 bytes at the kernel 0x0009 0000, which then loads setup() to 0x0009 0200, and then jumps to run setup()
  • Real and Protected Modes

    • Addressing modes in 80x86.
    • Real mode: segement + offset, the segement address is shifted left by four bits (10 bits)
    • Protected mode: uses page tables
      • Now has a User/Supervisor flag
      • When set to 0, page can only be accessed in kernel mode
      • CPL (current privilage level) is 2 bits for the cs register
      • CPL = 3 for User mode, CPL = 0 for Kernel mode (there’s 4 rings in intel, but we only use two)

Setup

  • For ACPI systems, build a physical memory layout table in RAM
  • Sets up the keyboard and video card
  • Initializes the disk controllers
  • If EDD (enhanced disk drive) services, uses the BIOS to build a table of hard disks in RAM
  • Sets up the interrupt descriptor table (IDT)
    • Different devices can raise interrupts and you want something to happen
    • A table is a set of addresses that points to code that is run when an interrupt is run
  • Resets the FPU
  • Switches the CPU to protected mode
  • Jumps to startup_32() (or startup_64 now)
    • Initalizes segmentation registers and a provisional stack
    • Zeros out the kernel’s uninitialized data
  • Runs decompress_kernel()
    • Kernel is decompressed and loaded into memory, and then jumps to 0x1000 0000
  • second startup_32()
    • intializes segementation registers and a provisional stack
    • intializes kernel page tables
    • stores Page Global Dir address in cr3 register and starts paging, setting the bit PG in the cr0 register
    • sets up the Kernel Mode stack for process 0
      • what if there’s no task waiting for the kernel to schedule? the kernel schedules PID 0, which halts the CPU
      • every millisecond, an interrupt happens, which then rechecks the run queues
  • Nulls all the interrupt handlers
  • Copies paramenters from the BIOS into the boot command into the first page frame (in RAM)
  • Identifies the processor model
  • Initializes the interrupt descriptor table to null handlers
  • Jumps to start_kernel()

Kernel Startup

  • Nulls all the interrupt handlers
  • Calls sched_init() to initialize the schedulers
  • Initializes memory allocation
  • Sets up interrupt handlers with trap_init() and IRQ_init()
  • System date and time are set with time_init
  • CPU clock speed determined by calibrate_delay
  • Then we run kernel_thread()
  • Finally we are at /sbin/init!

Init and Run Levels

  • Users run levels (0-6 typically)
  • Distinguises between single user mode, GUI’s, etc
  • Default values
    • 0 - halt
    • 1 - single user mode
      • Booting into root halfway
    • 6 - reboot
    • S is used when entering runlevle 1
    • Change with telinit(8)
      • Unix applications know what name they’re called by, so egrep -> grep -E
  • You can specify the run level at the command line (GRUB or LILO)
  • File names
    • In the unix world, .tab is usually table, and rc is “run command”
    • /etc/init.d/rc
  • inittab entries are what starts

RC Scripts

  • Run command
  • Takes a run level as an arg
  • Then looks for a directory in /etc/rc/\/.d
  • Runs each scripts with an action arg
  • Run level 0, 6 action is stop
  • Others are start
  • To run a startup script, find the right run level, and look at the different scripts, and then set it there
  • But these are init rc and not systemd
  • Stopping processes - just run pkill
  • init came out of SystemV, which was the original version
    • originally commercialized by AT&T
    • Berkley did it with a different startup script rc.boot, rc, rc.local
  • upstart in ubuntu

Block Devices and Filesystems

Devices

  • Devices are special file
  • /dev is devices
  • When we need to do IO on a device, just go out and access it!
  • Block special files and character special files
    • ls -l -> c for character file, b for block file
    • major and minor numbers, represents type of the device, and reprsents the instance
      • 4k major numbers, 2^20 minor numbers
    • mknod is used to make a device special file
    • Nothing special about these files! You can make them anywhere
    • udev is a program that sits there and creates files in the /dev directory
      • devices can register themselves or get a number assigned to them
    • much effort has gone into making devices work together
    • traditional device operations
      • open()
      • close()
      • read()
      • write()
      • seek()
      • ioctl
        • lets you do particular things to particular device
        • usually takes some integers (like everything in unix world)
    • block devices
      • tend to do I/O in terms of blocks
      • like disk drives
      • usually more random access
      • disks for example
    • character devices
      • operate in bytes
      • usually more seequential
      • mouse/audio card/printer/tape/terminal for example

Block I/O

  • How does block IO work?
  • We want to write some bytes a disk, eventually we must turn this into a block
  • Top level: Virtual file system
    • Came along SystemV (release 4)
    • Pre VFS, you had to mount a read/write for every single file system type
    • VFS unifed all of this, generic read/write/open/close
    • Cool thing that VFS brought: /proc file system! /cgroup file system! We can have file systems that don’t even represent a storage device
  • Page cache
    • There used to a buffer cache for blocks
    • and page caches for pages
    • now we only have one page cache
    • when you make a write call to the disk, the kernel is actually taking the data to write it at a later time
    • rapid return of writes is a performance gain but a reliability loss
    • if you’ve written a lot of data, you might need to get it again! hence why it’s held in the page cache
      • how pages stay around? We also page cache our writes
      • NOT the pages that represent the memory map
      • two different areas of memory
      • page cache for regular files, page cache for block IO
      • swapped data go into page cache, even special files
      • shared memory trick? Memory map fake files in the page cache, and forks can read it back out
    • Each page is owned by an inode
      • also has an offset for that file
    • We want to periodically scan the page cache for dirty pages and flush to disk
      • bdflush in the past
    • We also want to track how long pages have been dirty and flush before too long in the cache
      • kupdate in the past
    • Now kernel runs varying number of pdflush threads
  • Mapping layer
    • Most of the action
    • FS takes the pointer to the data, allocates appropriate pages, and then turns it into blocks to be written
    • FS knows how to talk SCSI or SATA or NVMe
    • A disk is represented as a sequential list of blocks, which then we map over
    • We also have a “segement”, we should try to find the blocks that are together, which can be sent together
    • Pages are 4kb, sectors are 512 B, blocks can range between these two sizes (although pages are huge now)
    • once blocks are known, send the request to the generic block layer via submit_bh() and ll_rw_block() for multiple blocks
    • arguments are direction (r/w) and buffer heads
  • Generic block layer
    • Gets sent data from the mapping layer

procs ———–memory———- —swap– —–io—- -system– ——cpu—– r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 65022788 1420664 60968808 0 0 0 9 0 0 3 0 97 0 0

  • Generic block layer combines contingous blocks into segments called a bio (block IO unit)
  • calls generic_make_request() (send_bio_noacct now maybe?)
  • Then it sends it to the device driver using the strategy() routine
  • file systems
    • Stores data on disk
    • Data and metadata
    • Metadata is permissions, ownership, timestamp, size
    • Efficient data access is important
    • Usually metadata access is the most important
    • Metadata is cached on the system
    • Journaling helps (logging)
      • Like the traditional DB two-phase commit
      • First write log, then append to last
      • Then you update the actual disk with the log
    • If a disk is an inconsistent state, run fsck -> but this is SLOW
    • Most journaling filesystems use it only for metadata, but some use it for all. This is usually configurable like ext2 and ext3
      • Ext3 without journalism is just ex2
      • You can put journals on a separate disk (since journal IO isn’t competing with disk as much)

Files

  • inodes
    • Every file system has an inode (contains all the information returned by ls -l)
    • device
    • mode
    • links
    • uid and gid
    • size in bytes
    • size in blocks
      • Block addresses are stored too, 12 direct blocks, one indirect block address, one double-indrect block address, and one triple-indirect block address
      • 15 total slots
      • the reason there’s indirect is because the first 12k is stored in the file
        • MB files therefore have to use extra blocks
        • next address points to a block, that is filled with addresses
        • second indirect block points to a block that is filled with addresses that go to a different block!
        • third, so on
        • direct: 12kb
        • indirect: 256 blocks = 256kb
        • double: 256^2 = 64MB
        • triple: 256^3 = 16 TB
        • this goes back at least to version 7
    • atime, mtime, ctime
      • atime -> access time, last time file was open()
      • mtime -> last time file was modified
      • ctime -> last time inode was changed (ctime is NOT creation time, most unix systems don’t track creation time)
    • Multiple directions that point to the same inode is a hard link
    • One inode per file
  • Unix timestamps
    • Since Jan 1st 1970 in UTC
    • 32 bits
    • Jan 18th, 2038
  • Sparse files
    • What if you write some bytes into the beginning, but then jump wayyy down and write later, where there’s no data in between.
    • Logically, the bytes in between are filled with zeros
    • Don’t allocate for stuff in between
    • If you copy a sparse file, dumb copies will take a bad copy!
  • Superblock
    • Information about the filesystem
    • Much of what is in the df command

Networking, Configuring & Building a Kernel

  • OSI model -> physical, data link, network, transport, session, presentation, application - networks
  • What a socket?
    • IP address and port on each end, which uniquely identifies a connection
  • Network layer implements IP to retransmit packets, but this is machine to machine
    • Intermediate devices bridge this gap (usually a router)
  • Ethernet/data link layer is what’s next, also packet switched with ethernet address 00:11:22:33:44:55 (mac address), NIC to NIC
  • Networking - physical layer, signal level
    • CSMA/CD (carrier sense multiple access with collision detect)
      • old: 4 way stop sign, everyone tries to transmit and hope that there’s nothing there
      • old: token rings, everyone gets a turn like a traffic light
    • carrier modulates an IO signal
    • twisted pair cables
    • a network device is a special animal
    • not a character or a block device
    • no major/minor numbers
    • hardware interface initiates IO
    • not in /dev
    • when a device driver gets registered, it adds the device to the internal list
    • interface is also different!
      • open
      • stop
      • hard_start_xmit() - start transmit packets
      • rebuild_header - using ARP (address resolution protocol)
      • tx_timeout
      • net_device_status - when netstat gets called
      • set_config
    • also ioctl
      • SIOCSIFADDR
      • SIOCSIFFLAGS
      • “socket IO control set InterFace Flags”
  • Packet transmission
    • Data is placed on a socket buffer
    • call hard_start_xmit
    • OS checks to make sure you’re doing less than the Maximum Transmission Unit (MTU)
    • Places in transmission queue
    • Driver processes queue and sends to hardware
  • Packet Reception
    • Two ways
      • Interrupt driven (most drivers use this one)
      • polled
    • Recieves incoming data, places it in socket buffer
    • If IO rate is too high, use polling
  • Address Resolution Protocol
    • Associates IP with a MAC address
    • Broadcasts an ARP request on the network: “who has 10.1.2.3”?
    • Saves entries into the ARP cache
    • Manipulate with arp
  • Configure Interface
    • always boils down to ifconfig(8)
  • Linux Packet Filter
    • Basically the “firewall”, now managed by iptables or ufw

      • ipchains and ipfwadm
    • Sets a list of rules, accept a match and execute an action

    • Three basic chains: input, forward, and output

      • Each packet only follows one chain!
      • there’s a default rule for every chain
    • Connection tracking?

      • ip_conntrack.o module
    • List all the rules with iptables -L -v

  • Router table
    • Kernel has a routing function?
    • Usually one of the multiple interfaces, so router table
    • Order matters, and search until we match a pattern
    • netstat -nr also shows you the router table (-n for hosts as IPs instead of having it do DNS)
    • route command for checking the router table
  • Kernel Build
    • go to kernel.org
    • need bunzip2 to decompress
    • can build as anyone, only need root to instal
    • why?
      • add a new feature/driver, or your driver was left out
      • or to leave out specific drivers

Crashes

Exceptions

  • Exceptions can happen because of invalid access or when a runtime check fails. This throws a SIGABRT

Tracing

Debian

perf performance issues

Linux Audio

LD_PRELOAD and LD_LIBRARY_PATH

Linux Kernel Headers

Cellular Modems

io_uring

Huge Pages

-https://www.evanjones.ca/hugepages-are-a-good-idea.html

Monitoring Tools

Guide: https://www.brendangregg.com/blog/2015-07-08/choosing-a-linux-tracer.html

Perf

Mounting

cgroups and names

Objects

  • objdump to look at elf files