0% found this document useful (0 votes)
244 views

Journey To The Center of The Linux Kernel: Traffic Control, Shaping and Qos

This document provides a summary of the Traffic Control subsystem in the Linux kernel. It describes how Traffic Control uses queuing disciplines (qdiscs) like FIFO, SFQ, TBF, and HTB to schedule and shape outgoing network traffic. It provides an example configuration using HTB to limit web server traffic marked with Netfilter to 200kbps, while allowing all other traffic up to 1Mbps. Key steps include marking packets with Netfilter, creating classes in an HTB tree to schedule traffic, and using a filter to connect packet marks to classes. This allows gaining complete control over traffic passing through a Linux system.

Uploaded by

ph23625765289516
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
244 views

Journey To The Center of The Linux Kernel: Traffic Control, Shaping and Qos

This document provides a summary of the Traffic Control subsystem in the Linux kernel. It describes how Traffic Control uses queuing disciplines (qdiscs) like FIFO, SFQ, TBF, and HTB to schedule and shape outgoing network traffic. It provides an example configuration using HTB to limit web server traffic marked with Netfilter to 200kbps, while allowing all other traffic up to 1Mbps. Key steps include marking packets with Netfilter, creating classes in an HTB tree to schedule traffic, and using a filter to connect packet marks to classes. This allows gaining complete control over traffic passing through a Linux system.

Uploaded by

ph23625765289516
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

JourneytotheCenteroftheLinuxKernel:TrafficControl,ShapingandQoS

JulienVehent[http://jve.linuxwall.info]seerevisions[http://wiki.linuxwall.info/doku.php/en:ressources:dossiers:networking:traffic_control?do=revisions]

1Introduction
ThisdocumentdescribestheTrafficControlsubsystemoftheLinuxKernelindepth,algorithmbyalgorithm,andshowshowitcanbeusedtomanage theoutgoingtrafficofaLinuxsystem.Throughoutthechapters,wewilldiscussboththetheorybehindandtheusageofTrafficControl,anddemonstrate howonecangainacompletecontroloverthepacketspassingthroughhissystem.

aQoSgraph TheinitialtargetofthispaperwastogainabettercontroloverasmallDSLuplink.Anditgrewovertimetocoveralotmorethanthat.99%ofthe informationprovidedherecanbeappliedtoanytypeofserver,aswellasrouters,firewalls,etc TheTrafficControltopicislargeandinconstantevolution,asistheLinuxKernel.Therealcreditgoestothedevelopersbehindthe/netdirectoryofthe kernel,andalloftheresearcherswhocreatedandimprovedallofthesealgorithms.Thisismerelyanattempttodocumentsomeofthisworkforthe masses.Anyparticipationandcommentsarewelcome,inparticularifyouspottedaninconsistencysomewhere.Pleaseemailjulien[at]linuxwall.info, yourmessagesarealwaysmostappreciated. Forthetechnicaldiscussion,sincetheLARTCmailinglistdoesn'texistsanymore,trythosetwo: *Netfilterusersmailinglist[http://vger.kernel.org/vgerlists.html#netfilter]forgeneraldiscussions*NetDevmailinglist[http://vger.kernel.org/vger lists.html#netdev]]iswheremagichappens(developersML)

2Motivation
ThisarticlewasinitiallypublishedinthefrenchissueofGnu/LinuxMagazineFrance#127,inMay2010.GLMFis kindenoughtoprovideacontractthatreleasethecontentofthearticleunderCreativeCommonaftersometime.I extendedtheinitialarticlequiteabitsince,butyoucanstillfindtheoriginalfrenchversionhere
[http://wiki.linuxwall.info/doku.php/fr:ressources:dossiers:networking:traffic_control]

MyinterestforthetrafficshapingsubsystemofLinuxstartedaround2005,whenIdecidedtohostmostofthe servicesIusemyself.Ireadthedocumentationavailableonthesubject(essentiallyLARTC[http://lartc.org/])but founditincompleteandendedupreadingtheoriginalresearchpublicationsandthesourcecodeofLinux. IhostmostofmyInternetservicesmyself,athomeoronsmallendservers(dediboxandsoon).Thisincludes webhosting(thiswiki,someblogsandafewwebsites),SMTPandIMAPservers,someFTPservers,XMPP, DNSandsoon.FrenchISPsallowthis,butonlygiveyou1Mbps/128KBpsofuplink,whichcorrespondstothe TCPAcksratenecessaryfora20Mbpsdownlink. 1Mbpsisenoughformostusage,butwithouttrafficcontrol,anyweeklybackuptoanexternallocationfillsupthe DSLlinkandslowsdowneveryoneonthenetwork.Duringthattime,boththevisitorsofthiswikiandmywife chattingonskypewillexperiencehighlatency.Thisisnotacceptable,becausethepriorityoftheweeklybackupis alotlowerthanthetwoothers.Linuxgiveyoutheflexibilitytoshapethenetworktrafficanduseallofyour bandwidthefficiently,withoutpenalizingrealtimeapplications.Butthiscomeswithaprice,andthelearningcurvetoimplementanefficienttrafficcontrol policyisquitesteep.ThisdocumentprovidesanaccurateandcomprehensiveintroductiontothemostusedQoSalgorithms,andgivesyouthetoolsto implementandunderstandyourownpolicy.

3ThebasicsofTrafficControl
IntheInternetworld,everythingispackets.Managingannetworkmeansmanagingpackets:howtheyaregenerated,router,transmitted,reorder, fragmented,etcTrafficControlworksonpacketsleavingthesystem.Itdoesn't,initially,haveasanobjectivetomanipulatepacketsenteringthe system(althoughyoucoulddothat,ifyoureallywanttoslowdowntherateatwhichyoureceivepackets).TheTrafficControlcodeoperatesbetween theIPlayerandthehardwaredriverthattransmitsdataonthenetwork.Wearediscussingaportionofcodethatworksonthelowerlayersofthe networkstackofthekernel.Infact,theTrafficControlcodeistheveryoneinchargeofconstantlyfurnishingpacketstosendtothedevicedriver. ItmeansthattheTCmodule,thepacketscheduler,ispermanentlyactivateinthekernel.Evenwhenyoudonotexplicitlywanttouseit,it'sthere schedulingpacketsfortransmission.Bydefault,thisschedulermaintainsabasicqueue(similartoaFIFOtypequeue)inwhichthefirstpacketarrived isthefirsttobetransmitted.

Atthecore,TCiscomposedofqueuingdisciplines,orqdisc,that representtheschedulingpoliciesappliedtoaqueue.Severaltypesofqdisc exist.IfyouarefamiliarwiththewayCPUschedulerswork,youwillfind thatsomeoftheTCqdiscaresimilar.WehaveFIFO,FIFOwithmultiple queues,FIFOwithhashandroundrobin(SFQ).WealsohaveaToken BucketFilter(TBF)thatassignstokenstoaqdisctolimititflowrate(no token=notransmission=waitforatoken).Thislastalgorithmwasthen extendedtoahierarchicalmodecalledHTB(HierarchicalTokenBuket). AndalsoQuickFairQueuing(QFQ),HierarchicalFairServiceCurve (HFSC),RandomEarlyDetection(RED),etc. Foracompletelistofalgorithm,checkoutthesourcecodeatkernel.org
[http://git.kernel.org/?p=linux/kernel/git/next/linux next.gita=treef=net/schedhb=HEAD].

3.1Firstcontact
Let'sskipthetheoryfornowandstartwithaneasyexample.Wehavea webserveronwhichwewouldliketolimittheflowrateofpacketsleaving theserver.Wewanttofixthatlimitat200kilobitsperseconds(25KB/s). Thissetupisfairlysimple,andweneedthreethings: 1. aNetfilterruletomarkthepacketsthatwewanttolimit 2. aTrafficControlpolicy 3. aFiltertobindthepacketstothepolicy

3.2NetfilterMARK
Netfiltercanbeusedtointeractdirectlywiththestructurerepresentingapacketinthekernel.Thisstructure,thesk_buff[http://git.kernel.org/? p=linux/kernel/git/next/linuxnext.gita=blobf=include/linux/skbuff.h],containsafieldcalled__u32nfmarkthatwearegoingtomodify.TCwillthenreadthat valuetoselectthedestinationclassofapacket. Thefollowingiptablesrulewillapplythemark'80'tooutgoingpackets(OUTPUTchain)sentbythewebserver(TCPsourceportis80).
#iptablestmangleAOUTPUToeth0ptcpsport80jMARKsetmark80

Wecancontroltheapplicationofthisruleviathenetfilterstatistics:
#iptablesLOUTPUTtmanglev ChainOUTPUT(policyACCEPT74107packets,109Mbytes) pktsbytestargetprotoptinoutsourcedestination 73896109MMARKtcpanyeth0anywhereanywheretcpspt:wwwMARKxset0x50/0xffffffff

Youprobablynoticedthattheruleislocatedinthemangletable.Wewillgobacktothatalittlebitlater.

3.3Twoclassesinatree
TomanipulateTCpolicies,weneedthe/sbin/tcbinaryfromthe**iproute**package [http://www.linuxfoundation.org/collaborate/workgroups/networking/iproute2](aptitudeinstalliproute). Theiproutepackagemustmatchyourkernelversion.Yourdistribution'spackagemanagerwillnormallytakecareofthat. Wearegoingtocreateatreethatrepresentsourschedulingpolicy,andthatusestheHTBscheduler.Thistreewillcontaintwoclasses:oneforthe markedtraffic(TCPsport80),andoneforeverythingelse.
#tcqdiscadddeveth0roothandle1:htbdefault20 #tcclassadddeveth0parent1:0classid1:10htbrate200kbitceil200kbitprio1mtu1500 #tcclassadddeveth0parent1:0classid1:20htbrate824kbitceil1024kbitprio2mtu1500

Thetwoclassesareattachedtotheroot.Eachclasshasaguaranteedbandwidth(ratevalue)andanopportunisticbandwidth(ceilvalue).Ifthetotality ofthebandwidthisnotused,aclasswillbeallowedtoincreaseditsflowrateuptotheceilvalue.Otherwise,theratevalueisapplied.Itmeansthatthe sumoftheratevaluesmustcorrespondtothetotalbandwidthavailable. Inthepreviousexample,weconsiderthetotaluploadbandwidthtobe1024kbits/s,soclass10(webserver)gets200kbits/sandclass20(everything else)gets824kbits/s. TCcanusebothkbitandkbpsnotations,buttheydon'thavethesamemeaning.kbitistherateinkilobitsperseconds,andkbpsisinkilobytesper seconds.Inthisarticle,Iwillthekbitnotationonly.

3.4Connectingthemarkstothetree
Wenowhaveononesideatrafficshapingpolicy,andontheothersidepacketsmarking.Toconnectthetwo,weneedafilter. Afilterisarulethatidentifypackets(handleparameter)anddirectthemtoaclass(fwflowidparameter).Sinceseveralfilterscanworkinparallel,they canalsohaveapriority.AfiltermustbeattachedtotherootoftheQoSpolicy,otherwise,itwon'tbeapplied.
#tcfilteradddeveth0parent1:0protocolipprio1handle80fwflowid1:10

Wecantestthepolicyusingasimplyclient/serversetup.Netcatisveryusefulforsuchtesting.Startalisteningprocessontheserverthatappliesthe policyusing:

#nclp80</dev/zero

Andconnecttoitfromanothermachineusing:
#nc192.168.1.180>/dev/null

Theserverprocesswillsendzeros(takenfrom/dev/zero)asfastasitcan,andtheclientwillreceivethemandthrowthemaway,asfastasitcan. Usingiptraftomonitortheconnection,wecansupervisethebandwidthusage(bottomrightcorner).

Thevalueis199.20kbits/s,whichiscloseenoughtothe200kbits/starget.Theprecisionoftheschedulerdependsonafewparametersthatwewill discusslateron. AnyotherconnectionfromtheserverthatusesasourceportdifferentfromTCP/80willhaveaflowratebetween824kbits/sand1024kbits/s(depending onthepresenceofotherconnectionsinparallel).

4TwentyThousandLeaguesUndertheCode
Nowthatweenjoyedthisfirstcontact,itistimetogobacktothefundamentalsoftheQualityofServiceofLinux.Thegoalofthischapteristodiveinto thealgorithmsthatcomposethetrafficcontrolsubsystem.Lateron,wewillusethatknowledgetobuildourownpolicy. ThecodeofTCislocatedinthenet/scheddirectoryofthesourcesofthekernel.Thekernelseparatestheflowsenteringthesystem(ingress)fromthe flowsleavingit(egress).And,aswesaidearlier,itistheresponsibilityoftheTCmoduletomanagetheegresspath. Theillustrationbelowshowthepathofapacketinsidethekernel,whereitenters(ingress)andwhereitleaves(egress).Ifwefocusontheegresspath, apacketarrivesfromthelayer4(TCP,UDP,)andthenentertheIPlayer(notrepresentedhere).TheNetfilterchainsOUTPUTandPOSTROUTING areintegratedintheIPlayerandarelocatedbetweentheIPmanipulationfunctions(headercreation,fragmentation,).AttheexitoftheNATtableof thePOSTROUTINGchain,thepacketistransmittedtotheegressqueue,andthisiswhereTCstartsitswork. Almostallthedevisesuseaqueuetoscheduletheegresstraffic.Thekernelpossessesalgorithmsthatcanmanipulatethesequeues,theyarethe queuingdisciplines(FIFO,SFQ,HTB,).ThejobofTCistoapplythesequeuingdisciplinestotheegressqueueinordertoselectapacketfor transmission. TCworkswiththesk_buff[http://git.kernel.org/?p=linux/kernel/git/next/linuxnext.gita=blobf=include/linux/skbuff.h]]structurethatrepresentsapacketinthe kernel.Itdoesn'tmanipulatethepacketdatadirectly.sk_buffisasharedstructureaccessibleanywhereinthekernel,thusavoidingunnecessary duplicationofdata.Thismethodisalotmoreflexibleandalotfasterbecausesk_buffcontainsallofthenecessaryinformationtomanipulatethepacket, thekernelthereforeavoidscopiesofheadersandpayloadsfromdifferentmemoryareasthatwouldruintheperformances.

Onaregularbasis,thepacketschedulerwillwakeupandlaunchanypreconfiguredschedulingalgorithmstoselectapackettotransmit. Mostoftheworkinlaunchbythefunctiondev_queue_xmitfromnet/core/dev.c,thatonlytakeask_buffstructureasinput(thisisenough,sk_buff containseverythingneeded,suchasskbdev,apointertotheoutputNIC). dev_queue_xmitmakessurethepacketisreadytobetransmittedonthenetwork,thatthefragmentationiscompatiblewiththecapacityoftheNIX,that thechecksumsarecalculated(ifthisishandledbytheNIC,thisstepisskipped).Oncethosecontrolsdone,andiftheequipmenthasaqueuein skbdevqdisc,thenthesk_buffstructureofthepacketisaddedtothisqueue(viatheenqueuefunction)andtheqdisc_runfunctioniscalledtoselect apackettosend. ThismeansthatthepacketthathasjustbeenaddedtotheNIC'squeuemightnotbetheoneimmediatelytransmitted,butweknowthatitisreadyfor subsequenttransmissionthemomentitisaddedtothequeue. Toeachdeviceisattachedarootqueuingdiscipline.Thisiswhatwedefinedearlierwhencreatingtherootqdisctolimittheflowrateofthewebserver:
#tcqdiscadddeveth0roothandle1:htbdefault20

Thiscommandmeansattacharootqdiscidentifiedbyid1tothedeviceeth0,usehtbasaschedulerandsendeverythingtoclass20bydefault. Wewillfindthisqdiscatthepointerskbdevqdisc.Theenqueueanddequeuefunctionsarealsolocatedthere,respectivelyin skbdevqdiscenqueue()andskbdevqdiscdequeue().Thislastdequeuefunctionwillbeinchargeofforwardingthesk_buffstructureofthe packetselectedfortransmissiontotheNIC'sdriver. Therootqdisccanhaveleaves,knownasclasses,thatcanalsopossesstheirownqdisc,thusconstructingatree.

4.1ClasslessDisciplines
Thisisaprettylongwaydown.Forthosewhowishestodeepenthesubject,IrecommendreadingUnderstandingLinuxNetworkInternals,from ChristianBenvenutiatO'reilly. Wenowhaveadequeuefunctionwhichroleistoselectthepackettosendtothenetworkinterface.Todoso,thisfunctioncallsschedulingalgorithms thatwearenowgoingtostudy.Therearetwotypesofalgorithms:ClasslessandClassful.Theclassfulalgorithmsarecomposedofqdiscthatcan containclasses,likewedidinthepreviousexamplewithHTB.Inopposition,classlessalgorithmscannotcontainclasses,andare(supposedly)more simple.

4.1.1PFIFO_FAST
Let'sstartwithasmallone.pfifo_fastisthedefaultschedulingalgorithmusedwhennootherisexplicitlydefined.Inotherwords,it'stheoneusedon 99.9%oftheLinuxsystems.ItisclasslessandatinybitmoreevolvedthanabasicFIFOqueue,sinceitimplements3bandsworkinginparallel.These bandsarenumbered0,1and2andemptiedsequentially:while0isnotempty,1willnotbeprocessed,and2willbethelast.Sowehave3priorities: thepacketsinqueue0willbesentbeforethepacketsinqueue2. ThekernelusestheTypeofServicefield(the8bitsfieldsfrombit8tobit15oftheIPheader,seebelow)toselectthedestinationbandofapacket.
0123 01234567890123456789012345678901 +++++++++++++++++++++++++++++++++ |Version|IHL|TypeofService|TotalLength|

+++++++++++++++++++++++++++++++++ |Identification|Flags|FragmentOffset| +++++++++++++++++++++++++++++++++ |TimetoLive|Protocol|HeaderChecksum| +++++++++++++++++++++++++++++++++ |SourceAddress| +++++++++++++++++++++++++++++++++ |DestinationAddress| +++++++++++++++++++++++++++++++++ |Options|Padding| +++++++++++++++++++++++++++++++++ ExampleInternetDatagramHeaderfromRFC791

Thisalgorithmisdefinedin**net/sched/sch_generic.c**[http://git.kernel.org/?p=linux/kernel/git/next/linux next.gita=blob_plainf=net/sched/sch_generic.chb=HEAD]andrepresentedinthediagrambelow.

diasource Thelengthofaband,representingthenumberofpacketitcancontain,issetto100bydefaultanddefinedoutsideofTC.It'saparameterthatcanbe setusingifconfig,andvisualizedin/sys:


#cat/sys/class/net/eth0/tx_queue_len 1000

Oncethedefaultvalueof1000ispassed,TCwillstartdroppingpackets.ThisshouldveryrarelyhappenbecauseTCPmakessuretoadaptitssending speedtothecapacityofbothsystemsparticipatinginthecommunication(that'stheroleoftheTCPslowstart).Butexperimentsshowedthatincreased thatlimitto10,000,oreven100,000,insomeveryspecificcasesofgigabitsnetworkscanimprovetheperformances.Iwouldn'trecommendtouching thisvalueunlessyoureallynowwhatyouaredoing.Increasingabuffersizetoatoolargevaluecanhaveverynegativesideeffectonthequalityofa connection.JimGettyscalledthatbufferbloat,andwewilltalkaboutitinthelastchapter. Becausethisalgorithmisclassless,itisnotpossibletopluganotherschedulerafterit.

4.1.2SFQStochasticFairnessQueuing
StochasticFairnessQueuingisanalgorithmthatsharesthebandwidthwithoutgivinganyprivilegeofanysort.Thesharingisfairbybeingstochastic,or randomifyouprefer.Theideaistotakeafingerprint,orhashofthepacketbasedonitsheader,andtousethishashtosendthepackettooneofthe 1024bucketsavailable.Thebucketsaresendemptiedinaroundrobinfashion. Themainadvantageofthismethodisthatnopacketwillhavethepriorityoveranotherone.Noconnexioncantakeovertheother,andeverybodyhasto share.Therepartitionofthebandwidthacrossthepacketswillalmostalwaysbefair,buttherearesomeminorlimitation.Themainlimitationisthatthe hashingalgorithmmightproducethesameresultforseveralpackets,andthussendthemtothesamebucket.Onebucketwillthenfillupfasterthanthe other,breakingthefairnessofthealgorithm.Tomitigatethis,SFQwillmodifytheparametersofitshashingalgorithmonaregularbasis,bydefault every10seconds. ThediagrambelowshowshowthepacketsareprocessedbySFQ,fromenteringthescheduleratthetop,tobeingdequeuedandtransmittedatthe bottom.Thesourcecodeisavailableinnet/sched/sch_sfq.c[http://git.kernel.org/?p=linux/kernel/git/next/linux next.gita=blob_plainf=net/sched/sch_sfq.chb=HEAD].Somevariablesarehardcodedinthesourcecode:*SFQ_DEFAULT_HASH_DIVISORgives thenumberofbucketsanddefaultto1024*SFQ_DEPTHdefinesthedepthofeachbucket,anddefaultsto128packets
#defineSFQ_DEPTH 128/*maxnumberofpacketsperflow*/ #defineSFQ_DEFAULT_HASH_DIVISOR1024

Thesetwovaluedeterminethemaximumnumberofpacketsthatcanbequeuedatthesametime.1024*128=131072packetscanbewaitinginthe SFQschedulerbeforeitstartsdroppingpackets.Butthistheoriticalnumberisveryunlikelytoeverhappen,becauseitwouldrequireofthealgorithmto fillupeverysinglebucketevenlybeforestartingtodequeueordroppackets. ThetccommandlineliststheoptionsthatcanbefedtoSFQ:


#tcqdiscaddsfqhelp Usage:...sfq[limitNUMBER][perturbSECS][quantumBYTES]

limit:isthesizeofthebuckets.ItcanbereducedbelowSFQ_DEPTHbutnotincreasedpastit.Ifyoutrytoputavalueabove128,TCwill simplyignoreit. perturb:isthefrequency,inseconds,atwhichthehashingalgorithmisupdated. quantum:representsthemaximumamountofbytesthattheroundrobindequeueprocesswillbeallowtodequeuefromabucket.Atabare minimum,thisshouldbeequaltotheMTUoftheconnexion(1500bytesonethernet).Imaginethatwesetthisvalueto500bytes,allpackets biggerthan500byteswouldnotbedequeuedbySFQ,andwouldstayintheirbucket.Wewould,veryquickly,arriveatapointwhereallbuckets areblockedandnopacketsaretransmittedanymore.

4.1.2.1SFQHashingalgorithm

DIAsource

ThehashusedbySFQiscomputedbythesfq_hashfunction(seethesourcecode),andtakeintoaccountthesourceanddestinationIPaddresses,the layer4protocolnumber(stillanIPheader),andtheTCPports(iftheIPpacketisnotfragmented).Thoseparametersaremixedwitharandomnumber regeneratedevery10seconds(theperturbvaluedefinesthat). Let'swalkthroughasimplifiedversionofthealgorithmsteps.Theconnexionhasthefollowingparameters: sourceIP:126.255.154.140 sourceport:8080 destinationIP:175.112.129.215 destinationport:2146


/*IPsourceaddressinhexadecimal*/ h1=7efffef0 /*IPDestinationaddressinhexadecimal*/ h2=af7081d7 /*06istheprotocolnumberforTCP(bits72to80oftheIPheader) WeperformaXORbetweenthevariableh2obtainedinthepreviousstep andtheTCPprotocolnumber */ h2=h2XOR06 /*iftheIPpacketisnotfragmented,weincludetheTCPportsinthehash*/ /*1f900862isthehexadecimalrepresentationofthesourcedestinationports WeperformanotherXORwiththisvalueandtheh2variable */ h2=h2XOR1f900862 /*Andfinally,weusetheJenkinsalgorithmwithsomeadditional"goldennumbers" Thisjhashfunctionisdefinedsomewhereelseinthekernelsourcecode*/ h=jhash(h1,h2,perturbation)

Theresultobtainedisahashvalueof32bitsthatwillbeusedbySFQtoselectthedestinationbucketofthepacket.Becausetheperturbvalueis regeneratedevery10seconds,thepacketsfromareasonablylongconnexionwillbedirectedtodifferentbucketsovertime. ButthisalsomeansthatSFQmightbreakthesequencingofthesendingprocess.Becauseiftwopacketsfromthesameconnexionareplacedintwo differentsbuckets,itispossiblethatthesecondbucketwillbeprocessedbeforethefirstone,andthereforesendingthesecondpacketbeforethefirst one.ThisisnotaproblemforTCP,whichusessequencenumbertoreorderthepacketswhentheyreachtheirdestination,butforUDPitmightbe. Forexample,imaginethatyouhaveasyslogdaemonsendinglogstoacentralsyslogserver.WithSFQ,itmighthappenthatalogprocessarrives beforethelogthatprecedesit.Ifyoudon'tlikeit,useTCP. TCgivesussomemoreflexibilityonthehashingalgorithm.Wecan,forexample,modifythefieldsconsideredbythehashingprocess.Thiscanbeused usingTCfilters,asfollow:

#tcqdiscadddeveth0roothandle1:sfqperturb10quantum3000limit64 #tcfilteradddeveth0parent1:0protocoliphandle1flowhashkeyssrc,dstdivisor1024

ThefilterabovesimplythehashtokeeponlythesourceanddestinationIPaddressesasinputparameters.Thedivisorvalueisthenumberofbuckets, asseenbefore.Wecould,then,createaSFQschedulerthatworkswith10bucketsonlyandconsiderstheIPaddressesofthepacketsinthehash. Thisdisciplineisclasslessaswell,whichmeanswecannotdirectpackettoanotherschedulerwhentheyleaveSFQ.Packetsaretransmittedtothe networkinterfaceonly.

4.2ClassfulDisciplines
4.2.1TBFTokenBucketFilter
Untilnow,welookedatalgorithmthatdonotallowtocontroltheamountofbandwidth.SFQandPFIFO_FASTgivetheabilitytosmoothenthetraffic, andeventoprioritizeitabit,butnottocontrolitsthroughput. Infact,themainproblemwhencontrollingthebandwidthistofindanefficientaccountingmethod.Becausecountinginmemoryisextremelydifficulty andcostlytodoinrealtime,computerscientiststookadifferentapproachhere. Insteadofcountingthepackets(orthebitstransmittedbythepackets,it'sthesamething),theTokenBucketFilteralgorithmsends,ataregular interval,atokenintoabucket.Nowthisisdisconnectedfromtheactualpackettransmission,butwhenapacketentersthescheduler,itwillconsumea certainnumberoftokens.Ifthereisnotenoughtokensforittobetransmitted,thepacketwaits. Untilnow,withSFQandPFIFO_FAST,weweretalkingaboutpackets,butwithTBFwenowhavetolookintothebitscontainedinthepackets.Let's takeanexample:apacketcarrying8000bits(1KB)wishestobetransmitted.ItenterstheTBFschedulerandTBFcontrolthecontentofitsbucket:if thereare8000tokensinthebucket,TBFdestroysthemandthepacketcanpass.Otherwise,thepacketwaitsuntilthebuckethasenoughtokens. Thefrequencyatwhichtokensareaddedintothebucketdeterminethetransmissionspeed,orrate.ThisistheparameteratthecoreoftheTBF algorithm,showninthediagrambelow.

DIAsource AnotherparticularityofTBFistoallowbursts.Thisisanaturalsideeffectofthealgorithm:thebucketfillsupatacontinuousrate,butifnopacketsare beingtransmittedforsometime,thebucketwillgetcompletelyfull.Then,thenextpacketstoenterTBFwillbetransmittedrightaway,withouthavingto waitandwithouthavinganylimitappliedtothem,untilthebucketisempty.Thisiscalledaburst,andinTBFtheburstparameterdefinesthesizeofthe bucket. Sowithaverylargeburstvalue,say1,000,000tokens,wewouldletamaximumof83fullyloadedpackets(roughly124KBytesiftheyallcarrytheir maximumMTU)traversetheschedulerwithoutapplyinganysortoflimittothem. Toovercomethisproblem,andprovidesbettercontroloverthebursts,TBFimplementsasecondbucket,smallerandgenerallythesamesizeasthe MTU.Thissecondbucketcannotstorelargeamountoftokens,butitsreplenishingratewillbealotfasterthattheoneofthebigbucket.Thissecond rateiscalledpeakrateanditwilldeterminethemaximumspeedofaburst. Let'stakeastepbackandlookatthoseparametersagain.Wehave: peakrate>rate:thesecondbucketfillsupfasterthanthemainone,toallowandcontrolbursts.Ifthepeakratevalueisinfinite,thenTBF behavesasifthesecondbucketdidn'texist.Packetswouldbedequeuedaccordingtothemainbucket,atthespeedofrate. burst>MTU:thesizeofthefirstbucketisalotlargerthanthesizeofthesecondbucket.IftheburstisequaltoMTU,thenpeakrateisequalto rateandthereisnoburst. So,tosummarize,wheneverythingworkssmoothlypacketsareenqueuedanddequeuedatthespeedofrate.Iftokensareavailablewhenpackets enterTBF,thosepacketsaretransmittedatthespeedofpeakrateuntilthefirstbucketisempty.Thisisrepresentedinthediagramabove,andinthe sourcecodeatnet/sched/sch_tbf.c[http://git.kernel.org/?p=linux/kernel/git/next/linuxnext.gita=blob_plainf=net/sched/sch_tbf.chb=HEAD],the interestingfunctionbeingtbf_dequeue.c.

TheconfigurableoptionsoftheTBFschedulerarelistedinTC:
#tcqdiscaddtbfhelp Usage:...tbflimitBYTESburstBYTES[/BYTES]rateKBPS[mtuBYTES[/BYTES]] [peakrateKBPS][latencyTIME][overheadBYTES][linklayerTYPE]

Werecognizeburst,rate,mtuandpeakratethatwediscussedabove.Thelimitparameteristhesizeofthepacketqueue(seediagramabove). latencyrepresentsanotherwayofcalculatingthelimitparameter,bysettingthemaximumtimeapacketcanwaitinthequeue(thesizeofthequeueis thenderivatedfromit,thecalculationincludesallofthevaluesofburst,rate,peakrateandmtu).overheadandlinklayeraretwootherparameters whosestoryisquiteinteresting.Let'stakealookatthosenow. 4.2.1.1DSLandATM,theOxthatbelievedtobeFrog IfyouhaveeverreadJeanDeLaFontaine[http://en.wikipedia.org/wiki/Jean_de_La_Fontaine],youprobablyknowthestoryoftheThefrogthatwantedto beasbigasanox.Well,inourcase,it'stheopposite,andyourpacketsmightnotbeassmallastheythinktheyare. IfmostlocalnetworksuseEthernet,upuntilrecentlyalotofcommunicationsinside(atleastinEurope)weredoneovertheATMprotocol.Nowadays, ISParemovingtowardingalloverip,butATMisstillaround.TheparticularityofATMistosplitlargeethernetpacketsintomuchsmallerones,called cells.A1500bytesethernetpacketwouldbesplitinto~30smallerATMcellsofjust53byteseach.Andfromthose53bytes,only48arefromthe originalpacket,therestisoccupiedbytheATMheaders. Sowhereistheproblem?Consideringthefollowingnetworktopology.

TheQoSboxisinchargeofperformingthepacketschedulingbeforetransmittingittothemodem.ThepacketsarethensplitbythemodemintoATM cells.Soourinitial1.5KBethernetpacketsissplitinto32ATMcells,foratotalsizeof32*5bytesofheaderspercell+1500bytesofdata= (32*5)+1500=1660bytes.1660bytesis10.6%biggerthan1500.WhenATMisused,welose10%ofbandwidthcomparedtoanethernetnetwork(this isanestimatethatdependontheaveragepacketsize,etc). IfTBFdoesn'tknowaboutthat,andcalculatesitsratebasedonthesoleknowledgeoftheethernetMTU,thenitwilltransmit10%morepacketsthan themodemcantransmit.Themodemwillstartqueuing,andeventuallydropping,packets.TheTCPstackswillhavetoadjusttheirspeed,trafficgets erraticandwelosethebenefitofTCasatrafficshaper. JesperDangaardBrouerdidhisMasterThesisonthistopic,andwroteafewpatchsforthekernelandTC.Thesepatchsimplementtheoverheadand linklayerparameter,andcanbeusedtoinformtheTBFschedulerofthetypeoflinktoaccountfor. *overheadrepresentsthequantityofbytesaddedbytheATMheaders,5bydefault*linklayerdefinesthetypeoflink,eitherethernetor{atm,adsl}. atmandadslarethesamethingandrepresenta5bytesheaderoverhead WecanusetheseparametertofinetunethecreationofaTBFschedulingdiscipline:
#tcqdiscadddeveth0roottbfrate1mbitburst10klatency30mslinklayeratm #tcsqdiscshowdeveth0 qdisctbf8005:rootrate1000Kbitburst10Kblat30.0ms Sent738bytes5pkt(dropped0,overlimits0requeues0) rate0bit0ppsbacklog0b0prequeues0

Andbysettingthelinklayertoatm,wetakeintoaccountanoverheadof5bytespercelllaterinthetransmission,thereforepreventingthemodemfrom bufferingthepackets. 4.2.1.2ThelimitsofTBF TBFgivesaprettyaccuratecontroloverthebandwidthassignedtoaqdisc.Butitalsoimposesthatallpacketspassthroughasinglequeue.Ifabig packetisblockedbecausethereisnotenoughtokenstosendit,smallerpacketsthatcouldpotentiallybesentinsteadareblockedbehinditaswell. Thisisthecaserepresentedinthediagramabove,wherepacket#2isstuckbehindpacket#1.Wecouldoptimizethebandwidthusagebyallowingthe smallerpackettobesentinsteadofthebiggerone.Wewould,however,fallintothesameproblemofreorderingpacketsthatwediscussedwiththe SFQalgorithm. Theothersolutionwouldbetogivemoreflexibilitytothetrafficshaper,declareseveralTBFqueuesinparallel,androutethepacketstooneortheother usingfilters.Wecouldalsoallowthoseparallelqueuestoborrowtokensfromeachother,incaseoneisidleandtheotheroneisnot. Wejustpreparedthegroundforclassfulqdisc,andtheHierarchicalTokenBucket.

4.2.2HTBHierarchicalTokenBucket
TheHierarchicalTokenBucket(HTB)isanimprovedversionofTBFthatintroducesthenotionofclasses.Eachclassis,infact,aTBFlikeqdisc,and classesarelinkedtogetherasatree,witharootandleaves.HTBintroducesanumberoffeaturestoimprovedthemanagementofbandwidth,suchasa theprioritybetweenclasses,awaytoborrowbandwidthfromanotherclass,orthepossibilitytopluganotherqdiscasanexitpoint(aSFQforexample). Let'stakeasimpleexample,representedinthediagrambelow.

htb_en.dia.zip Thetreeiscreatedwiththecommandsbelow:
#tcqdiscadddeveth0roothandle1:htbdefault20 #tcclassadddeveth0parent1:0classid1:10htbrate200kbitceil400kbitprio1mtu1500 #tcclassadddeveth0parent1:0classid1:20htbrate200kbitceil200kbitprio2mtu1500

HTBusesasimilarterminologythanTBFandSFQ: burstisidenticaltotheburstofTBF:it'sthesizeofthetokenbucket rateisthespeedatwhichtokenatgeneratedandputinthebucket,thespeedoftheleaf,likeinTBF quantumissimilartothequantumdiscussedinSFQ,it'stheamountofbytestoservefromtheleafatonce. Thenewparametersareceilandcburst.Letuswalkthroughthetreetoseehowtheywork. Intheexampleabove,wehavearootqdischandle1andtwoleavesqdischandle10andqdischandle20.Therootwillapplyfilterstodecidewhere apacketshouldbedirected,wewilldiscussthoselater,andbydefaultpacketsaresenttoleaf#20(defaultto20). Theleaf#10hasaratevalueof200kbits/s,aceilvalueofof400kbibts/s(whichmeansitcanborrow200kbits/smorethatitsrate)andapriority(prio)of 1. Theleaf#20hasaratevalueof200kbits/s,aceilvalueof200kbbits/s(whichmeansitcannotborrowanything,rate==ceil)andapriorityof2. EachHTBleafwill,atanypoint,haveoneofthethreefollowingstatus: HTB_CANT_SEND:theclasscanneithersendnorborrow,nopacketsareallowedtoleavetheclass HTB_MAY_BORROW:theclasscannotsendusingitsowntokens,butcantrytoborrowfromanotherclass HTB_CAN_SEND:theclasscansendusingitsowntokens ImagineagroupofpacketsthatenterTCandaremarkedwiththeflag#10,andthereforearedirectedtoleaf#10.Thebucketforleaf#10doesnot containenoughtokenstoletthefirstpacketspass,soitwilltrytoborrowsomefromitsneighborleaf#20.Thequantumvalueofleaf#10issettothe MTU(1500bytes),whichmeansthemaximalamountofdatathatleaf#10willtrytosendis1500bytes.Ifpacket#1is1400byteslarge,andthebucket inleaf#10hasenoughtokensfor1000bytes,thentheleafwilltrytoborrowtheremaining400bytesfromitsneightborleaf#20. Thequantumisthemaximalamountofbytesthataleafwilltrytosendatonce.ThecloserthevalueisfromtheMTU,themoreaccuratethescheduling willbe,becausewerescheduleafterevery1500bytes.Andthelargerthevalueofquantumwillbe,themorealeafwillbeprivileged:itwillbeallowedto borrowmoretokensfromitsneighbor.Butofcourse,sincethetotalamountoftokensinthetreeisnotunlimited,ifatokenisborrowedfromaleaf, anotherleafcannotuseitanymore.Therefore,thebiggerthevalueofquantumis,themorealeafisabletostealfromitsneighbor.Thisistricky becausethoseneighborsmightverywellhavepacketstosendaswell. WhenconfiguringTC,wedonotmanipulatethevalueofquantumdirectly.Thereisanintermediaryparametercalledr2qthatcalculatesthequantum automaticallybasedontherate.quantum=rate/r2q.Bydefault,r2qissetto10,soforarateof200kbits,quantumwillhaveavalueof20kbits. Forverysmallorverylargebandwidth,itisimportanttotuner2qproperly.Ifr2qistoolarge,toomanypacketswillleaveaqueueatonce.Ifr2qistoo small,noenoughpacketsaresent. Oneimportantdetailisthatr2qissetontherootqdisconceandforall.Itcannotbeconfiguredforeachleafseparately. TCofferthefollowingconfigurationoptionsforHTB:

Usage:...qdiscadd...htb[defaultN][r2qN] defaultminoridofclasstowhichunclassifiedpacketsaresent{0} r2qDRRquantumsarecomputedasrateinBps/r2q{10} debugstringof16numberseach03{0} ...classadd...htbrateR1[burstB1][mpuB][overheadO] [prioP][slotS][pslotPS] [ceilR2][cburstB2][mtuMTU][quantumQ] raterateallocatedtothisclass(classcanstillborrow) burstmaxbytesburstwhichcanbeaccumulatedduringidleperiod{computed} mpuminimumpacketsizeusedinratecomputations overheadperpacketsizeoverheadusedinratecomputations linklayadaptingtoalinklayere.g.atm ceildefiniteupperclassrate(noborrows){rate} cburstburstbutforceil{computed} mtumaxpacketsizewecreateratemapfor{1600} priopriorityofleaflowerareservedfirst{0} quantumhowmuchbytestoservefromleafatonce{user2q}

Asyoucansee,wearenowfamiliartoallofthoseparameters.IfyoujustjumpedtothissectionwithoutreadingaboutSFQandTBF,pleasereadthose chaptersforadetailedexplanationofwhatthoseparametersdo. Rememberthat,whenconfiguringleavesinHTB,thesumoftherateoftheleavescannotbehigherthantherateoftheroot.Itmakessense,right? 4.2.2.1HysteresisandHTB Hysteresis.Ifthisbarbarianwordisnotfamiliartoyou,asitwasn'ttome,hereishowwikipediadefinesit:Hysteresisisthedependenceofasystem notjustonitscurrentenvironmentbutalsoonitspast. HysteresisisasideeffectintroducedbyanoptimizationofHTB.InordertoreducetheloadontheCPU,HTBinitiallydidnotrecalculatethecontentof thebucketoftenenough,thereforeallowingsomeclassestoconsumemoretokensthattheyactuallyheld,withoutborrowing. TheproblemwascorrectedandaparameterintroducedtoalloworblocktheusageofestimateinHTBcalculation.Thekerneldeveloperskeptthe optimizationfeaturesimplybecauseitcanproveusefulinhightrafficnetworks,whererecalculatingthecontentofthebucketeachtimeissimplynot doable. Butinmostcases,thisoptimizationissimplydeactivated,asshownbelow:
#cat/sys/module/sch_htb/parameters/htb_hysteresis 0

4.2.3HFSCHierarchicalFairServiceCurve

http://nbd.name/gitweb.cgi?p=openwrt.gita=treef=package/qosscripts/filesh=71d89f8ad63b0dda0585172ef01f77c81970c8cchb=HEAD
[http://nbd.name/gitweb.cgi?p=openwrt.gita=treef=package/qosscripts/filesh=71d89f8ad63b0dda0585172ef01f77c81970c8cchb=HEAD]

4.2.4QFQQuickFairQueueing

http://www.youtube.com/watch?v=r8vBmybeKlE[http://www.youtube.com/watch?v=r8vBmybeKlE]

4.2.5REDRandomEarlyDetection

4.2.6CHOKeCHOoseand{Keep,Kill}

5ShapingthetrafficontheHomeNetwork
Homenetworksaretrickytoshape,becauseeverybodywantsthepriorityandit'sdifficulttopredetermineausagepattern.Inthischapter,wewillbuild aTCpolicythatanswergeneralneeds.Thoseare: Lowlatency.Theuplinkisonly1.5Mbpsandthelatencyshouldn'tbemorethan30msunderhighload.Wecantunethebuffersintheqdiscto ensurethatourpacketswillnotstayinalargequeuefor500mswaitingtobeprocessed HighUDPresponsiveness,forapplicationslikeSkypeandDNSqueries GuarantiedHTTP/sbandwidth,halfoftheuplinkisdedicatedtotheHTTPtraffic(although,otherclassescanborrowfromit)toensurethatweb browsing,probably80%ofahomenetworkusage,issmoothandresponsive TCPACKsandSSHtrafficgethigherpriority.IntheageofNetflixandHDVoD,it'snecessarytoensurefastdownloadspeed.Andforthat,you needtobeabletosendTCPACKsasfastaspossible.ThisiswhythosepacketsgetahigherprioritythantheHTTPtraffic. Ageneralclassforeverythingelse. Thispolicyisrepresentedinthediagrambelow.WewillusePFIFO_FASTandSFQterminationqdisconceweexitHTBtoperformsomeadditional scheduling(andpreventasingleHTTPconnectionfromeatingallofthebandwidth,forexample).

DIAsource Thescriptthatgeneratesthispolicyisavailableongithubviatheiconbelow,withcommentstohelpyoufollowthrough.

getthebashscriptfromgithub Belowisoneofthesection,inchargeofthecreationoftheclassforSSH.Ihavereplacedthevariableswiththeirvalueforreadability.
#SSHclass:foroutgoingconnectionsto #avoidlagwhensomebodyelseisdownloading #however,anSSHconnectioncannotfillup #theconnectiontomorethan70% echo"#sshid300rate160kbitceil1120kbit" /sbin/tcclassadddeveth0parent1:1classid1:300htb\ rate160kbitceil1120kbitburst15kprio3 #SFQwillmixthepacketsifthereareseveral #SSHconnectionsinparallel #andensurethatnonehasthepriority echo"#~subssh:sfq" /sbin/tcqdiscadddeveth0parent1:300handle1300:\ sfqperturb10limit32 echo"#~sshfilter" /sbin/tcfilteradddeveth0parent1:0protocolip\ prio3handle300fwflowid1:300 echo"#~netfilterruleSSHat300" /sbin/iptablestmangleAPOSTROUTINGoeth0ptcp tcpflagsSYNSYNdport22jCONNMARK\ setmark300

ThefirstruleisthedefinitionoftheHTBclass,theleaf.Iconnectsbacktoitsparent1:1,definesarateof160kbit/sandcanuseupto1120kbit/sby

borrowingthedifferencefromotherleaves.Theburstvalueissetto15k,withis10fullpacketswithaMTUof1500bytes. ThesecondruledefinesaSFQqdiscconnectedtotheHTBoneabove.ThatmeansthatoncepacketshavepassedtheHTBleaf,theywillpassthrough aSFQleafbeforebeingtransmitted.TheSFQwillensurethatmultipleparallelconnectionsaremixedbeforebeingtransmitted,andthatoneconnection cannoteatthewholebandwidth.WelimitthesizeoftheSFQqueueto32packets,insteadofthedefaultof128. ThecometheTCfilterinthethirdrule.Thisfilterwillcheckthehandleofeachpacket,or,tobemoreaccurate,thevalueofnf_markinthesk_buff representationofthepacketinthekernel.Usingthismark,thefilterwilldirectSSHpackettotheHTBleafabove. EventhoughthisruleislocatedintheSSHclassblockforclarity,youmighthavenoticedthatthefilterhastherootqdiscforparent(parent1:0).Filters arealwaysattachedtotherootqdisc,andnottotheleaves.Thatmakessense,becausethefilteringmustbedoneattheentranceofthetrafficcontrol layer. Andfinally,thefourthruleistheiptablesrulethatappliesamarktoSYNpacketsleavingthegateway(connectionestablishments).WhySYNpackets only?Toavoidperformingcomplexmatchingonallthepacketsofalltheconnection.Wewillrelyonnetfilter'scapabilitytomaintainstatefulinformation topropagateamarkplacedonthefirstpacketoftheconnectiontoalloftheotherpackets.Thisisdonebythefollowingruleattheendofthescript:
echo"#~propagatingmarksonconnections" iptablestmangleAPOSTROUTINGjCONNMARKrestoremark

Letusnowloadthescriptonourgateway,andvisualisetheqdisccreated.
#/etc/network/ifup.d/lnw_gateway_tc.shstart ~~~~LOADINGeth0TRAFFICCONTROLRULESFORramiel~~~~ #cleanup RTNETLINKanswers:Nosuchfileordirectory #defineaHTBrootqdisc #uplinkrate1600kbitceil1600kbit #interactiveid100rate160kbitceil1600kbit #~subinteractive:pfifo #~interactivefilter #~netfilterruleallUDPtrafficat100 #tcpacksid200rate320kbitceil1600kbit #~subtcpacks:pfifo #~filtretcpacks #~netfilterruleforTCPACKswillbeloadedattheend #sshid300rate160kbitceil1120kbit #~subssh:sfq #~sshfilter #~netfilterruleSSHat300 #httpbranchid400rate800kbitceil1600kbit #~subhttpbranch:sfq #~httpbranchfilter #~netfilterrulehttp/s #defaultid999rate160kbitceil1600kbit #~subdefault:sfq #~filtredefault #~propagatingmarksonconnections #~MarkTCPACKsflagsat200 TrafficControlisupandrunning #/etc/network/ifup.d/lnw_gateway_tc.shshow qdiscsdetails qdischtb1:rootrefcnt2r2q40default999direct_packets_stat0ver3.17 qdiscpfifo1100:parent1:100limit10p qdiscpfifo1200:parent1:200limit10p qdiscsfq1300:parent1:300limit32pquantum1514bflows32/1024perturb10sec qdiscsfq1400:parent1:400limit32pquantum1514bflows32/1024perturb10sec qdiscsfq1999:parent1:999limit32pquantum1514bflows32/1024perturb10sec qdiscsstatistics qdischtb1:rootrefcnt2r2q40default999direct_packets_stat0 Sent16776950bytes125321pkt(dropped4813,overlimits28190requeues0) rate0bit0ppsbacklog0b0prequeues0 qdiscpfifo1100:parent1:100limit10p Sent180664bytes1985pkt(dropped0,overlimits0requeues0) rate0bit0ppsbacklog0b0prequeues0 qdiscpfifo1200:parent1:200limit10p Sent5607402bytes100899pkt(dropped4813,overlimits0requeues0) rate0bit0ppsbacklog0b0prequeues0 qdiscsfq1300:parent1:300limit32pquantum1514bperturb10sec Sent0bytes0pkt(dropped0,overlimits0requeues0) rate0bit0ppsbacklog0b0prequeues0 qdiscsfq1400:parent1:400limit32pquantum1514bperturb10sec Sent9790497bytes15682pkt(dropped0,overlimits0requeues0) rate0bit0ppsbacklog0b0prequeues0 qdiscsfq1999:parent1:999limit32pquantum1514bperturb10sec Sent1198387bytes6755pkt(dropped0,overlimits0requeues0) rate0bit0ppsbacklog0b0prequeues0

Theoutputbelowisjusttwotypesofoutputtccangenerate.Youmightfindtheclassstatisticstobehelpfultodiagnoseleavesconsumption:
#tcsclassshowdeveth0 [...truncated...] classhtb1:400parent1:1leaf1400:prio4rate800000bitceil1600Kbitburst30Kbcburst1600b Sent10290035bytes16426pkt(dropped0,overlimits0requeues0) rate23624bit5ppsbacklog0b0prequeues0 lended:16424borrowed:2giants:0 tokens:4791250ctokens:120625

AboveisshownthedetailledstatisticsfortheHTTPleaf,andyoucanseetheaccumulatedrate,statisticsofpacketsperseconds,butalsothetokens accumulated,lended,borrowed,etcthisisthemosthelpfuloutputtodiagnoseyourpolicyindepth.

6Awordabout"BufferBloat"
Wementionnedthattoolargebufferscanhaveanegativeimpactontheperformancesofaconnection.Buthowbadisitexactly? TheanswertothatquestionwasinvestigatedbyJimGettys[http://gettys.wordpress.com/bufferbloatfaq/]whenhefoundhishomenetworktobe inexplicablyslow. Hefoundthat,whilewewereincreasingthebandwidthofnetworkconnections,wedidn'tworryaboutthelatencyatall.Thosetwofactorsarequite differentandbothcriticaltothegoodqualityofanetwork.AllowmetoquoteGettys'sFAQhere:
A100Gigabitnetworkisalwaysfasterthana 1megabitnetwork,isntit? Morebandwidthisalwaysbetter!Iwantafasternetwork! No,suchanetworkcaneasilybemuchslower. Bandwidthisameasureofcapacity,notameasureofhow fastthenetworkcanrespond.Youpickupthephoneto sendamessagetoShanghaiimmediately,butdispatchinga cargoshipfullofbluraydiskswillbeamazinglyslower thanthetelephonecall,eventhoughthebandwidthofthe shipisbillionsandbillionsoftimeslargerthanthe telephoneline.Somorebandwidthisbetteronlyifits latency(speed)meetsyourneeds.Moreofwhatyoudont needisuseless. Bufferbloatdestroysthespeedwereallyneed.

MoreinformationonGettys'spage,andinthispaperfrom1996:It'stheLatency,Stupid[http://rescomp.stanford.edu/~cheshire/rants/Latency.html]. Longstoryshort:ifyouhavebadlatency,butlargebandwidth,youwillbeabletotransferverylargefilesefficiently,butasimpleDNSquerywilltakea lotlongerthanitshould.AndsincethoseDNSqueries,andothersmallmessagesuchasVoIP,areveryoftentimesensitive,badlatencyimpactsthem alot(asinglepackettakesseveralhundredsofmillisecondstotraversethenetwork). Sohowdoesthatrelatetobuffers?Gettysproposesasimpleexperiment[http://gettys.wordpress.com/2010/11/29/homerouterpuzzlepieceonefunwith yourswitch/]toillustratetheproblem. WesaidearlierthatLinuxshipswithadefaulttxqueuelenof1000.Thisvaluewasincreasedinthekernelwhengigabitsethernetcardsbecamethe standard.Butnotalllinksaregigabits,farfromthat.Considerthefollowingtwocomputers:

Wewillcallthemdesktopandlaptop.Theyarebothgigabit,andtheswitchisgigabit. Ifweverifytheconfigurationoftheirnetworkinterfaces,wewillconfirmthat: theinterfacesareconfiguredingigabitsviaethtool thetxqueuelenissetto1000


#ifconfigeth0|greptxqueuelen collisions:0txqueuelen:1000 #ethtooleth0|grepSpeed Speed:1000Mb/s

Ononemachine,launchnttcpwiththe'i'switchtomakeitwaitforconnections:
#nttcpi

Onthelaptop,launchnttcptDn2048000<serverIP>,where tmeansthismachineistransmitting,orsendingdata DdisablestheTCP_NODELAYseeredhatdoc[http://docs.redhat.com/docs/enUS/Red_Hat_Enterprise_MRG/1.1/html/Realtime_Tuning_Guide/sect


Realtime_Tuning_GuideApplication_Tuning_and_DeploymentTCP_NODELAY_and_Small_Buffer_Writes.html]

nisthenumberofbuffersof4096bytesgiventothesocket.
#nttcptDn2048000192.168.1.220

Andatthesametime,onthelaptop,launchapingofthedesktop.
64bytesfrom192.168.1.220:icmp_req=1ttl=64time=0.300ms 64bytesfrom192.168.1.220:icmp_req=2ttl=64time=0.386ms 64bytesfrom192.168.1.220:icmp_req=3ttl=64time=19.2ms 64bytesfrom192.168.1.220:icmp_req=4ttl=64time=19.2ms 64bytesfrom192.168.1.220:icmp_req=5ttl=64time=19.2ms 64bytesfrom192.168.1.220:icmp_req=6ttl=64time=19.2ms 64bytesfrom192.168.1.220:icmp_req=7ttl=64time=19.3ms 64bytesfrom192.168.1.220:icmp_req=8ttl=64time=19.0ms 64bytesfrom192.168.1.220:icmp_req=9ttl=64time=0.281ms 64bytesfrom192.168.1.220:icmp_req=10ttl=64time=0.362ms

Thefirsttwopingsarelaunchbeforenttcpislaunched.Whennttcpstarts,thelatencyaugmentsbutthisisstillacceptable. Now,reducethespeedofeachnetworkcardonthedesktopandthelaptopto100Mbips.Thecommandis:

#ethtoolseth0speed100duplexfull #ethtooleth0|grepSpeed Speed:100Mb/s

Andrunthesametestagain.After60seconds,herearethelatencyIget:
64bytesfrom192.168.1.220:icmp_req=75ttl=64time=183ms 64bytesfrom192.168.1.220:icmp_req=76ttl=64time=179ms 64bytesfrom192.168.1.220:icmp_req=77ttl=64time=181ms

Andonelasttime,withanEthernetspeedof10Mbps:
64bytesfrom192.168.1.220:icmp_req=187ttl=64time=940ms 64bytesfrom192.168.1.220:icmp_req=188ttl=64time=937ms 64bytesfrom192.168.1.220:icmp_req=189ttl=64time=934ms

Almostonesecondoflatencybetweentwomachinesnexttoeachother.Everytimewedividethespeedoftheinterfacebyanorderofmagnitude,we augmentthelatencybyanorderofmagnitudeaswell.Underthatload,openinganSSHconnectionfromthelaptoptothedesktopistakingmorethan 10seconds,becausewehavealatencyofalmost1secondperpacket. Now,whilethislasttestisrunning,andwhileyouareenjoyingtheridiculouslatencyofyourSSHsessiontothedesktop,wewillgetridofthetransmit andethernetbuffers.


#ifconfigeth0|greptxqueuelen collisions:0txqueuelen:1000 #ethtoolgeth0 Ringparametersforeth0: [...] Currenthardwaresettings: [...] TX: 511

Westartbychangingthetxqueuelenvalueonthelaptopmachinefrom1000tozero.Thelatencywillnotchange.
#ifconfigeth0txqueuelen0 64bytesfrom192.168.1.220:icmp_req=1460ttl=64time=970ms 64bytesfrom192.168.1.220:icmp_req=1461ttl=64time=967ms

Thenwereducethesizeofthetxringoftheethernetcard.Nowthatwedon'thaveanybufferanymore,let'sseewhathappens:
#ethtoolGeth0tx32 64bytesfrom192.168.1.220:icmp_req=1495ttl=64time=937ms 64bytesfrom192.168.1.220:icmp_req=1499ttl=64time=0.865ms 64bytesfrom192.168.1.220:icmp_req=1500ttl=64time=60.3ms 64bytesfrom192.168.1.220:icmp_req=1501ttl=64time=53.1ms 64bytesfrom192.168.1.220:icmp_req=1502ttl=64time=49.2ms 64bytesfrom192.168.1.220:icmp_req=1503ttl=64time=45.7ms

Thelatencyjustgotdividedby20!Wedroppedfromalmostonesecondtobarely50ms.Thisistheeffectofexcessivebufferinginanetwork,andthis iswhathappens,today,inmostInternetrouters.

6.1Whathappensinthebuffer?
IfwetakealookattheLinuxnetworkingstack,weseethattheTCPstackisalotabovethetransmitqueueandethernetbuffer.DuringanormalTCP connection,theTCPstackstartssendingandreceivingpacketsatanormalrate,andacceleratesitssendingspeedatanexponentialrate:send2 packets,receiveACKs,send4packets,receiveACKs,send8packets,receivesACKs,send16packets,receivesACKS,etc. ThisisknownastheTCPSlowStart[http://tools.ietf.org/html/rfc5681].Thismechanismsworksfineinpractice,butthepresenceoflargebufferswillbreak it. Abufferof1MBona1Gbits/slinkwillemptyin~8milliseconds.Butthesamebufferona1MBits/slinkwilltake8secondstoempty.Duringthose8 seconds,theTCPstackthinksthatallofthepacketsitsenthavebeentransmitted,andwillprobablycontinuetoincreaseitssendingspeed.The subsequentpacketswillgetdropped,theTCPstackwillpanick,dropitssendingrate,andrestarttheslowstartprocedurefrom0:2packets,getack,4 packets,getack,etc ButwhiletheTCPstackwasfillinguptheTXbuffers,alltheotherpacketsthatoursystemwantedtosendgoteitherstucksomewhereinthequeue, withseveralhundredsofmillisecondsofdelaybeforebeingtransmitted,orpurelydropped. TheproblemhappensontheTXqueueofthesendingmachine,butalsoonallthethebuffersoftheintermediarynetworkdevices.Andthisiswhy Gettyswenttowaragainstthehomeroutervendors.

Discussion
PaulBixel,2011/12/0800:46 ThisisaninterestingarticleandIamespeciallyinterestedinyoudiscussionofthelinklayeratmoptionnowsupportedbyTC.Thiskindof explainationisneededbecausethereissolittlewrittinaboutthelinklayeroptionandthepropersettingsformtu/mpu/tsize&overhead. Inyourdiscussionyoumentiontheoverheadparameterisdefaultedto5anditisimpliedthatitisnotnecessarytherefortospecifyitwhenatmis used.Butaccordingtohttp://acehost.stuart.id.au/russell/files/tc/tcatm[http://acehost.stuart.id.au/russell/files/tc/tcatm]theoverheadparametersis

variableandmuchlargerthan5.Someoneisnotcorrect. JussiKivilinna,2011/12/2412:57 Justnoteonoverhead/linklayeroptions..since2.6.27therehasbeengenerictcstaboptionthataddsoverhead/linklayersupporttoallqdiscs.Some documentationat:http://www.linuxhowtos.org/manpages/8/tcstab.htm[http://www.linuxhowtos.org/manpages/8/tcstab.htm]


en/ressources/dossiers/networking/traffic_control.txtLastmodified:2011/12/2409:53byjulien

You might also like