Breedlove: c++ - sendmsg + raw ethernet + several frames -

c++ - sendmsg + raw ethernet + several frames -

i utilize linux 3.x , modern glibc(2.19).

i send several ethernet frames without switch kernel/user space forth , back.

i have mtu = 1500, , want send 800 kb. init receiver address this:

struct sockaddr_ll socket_address; socket_address.sll_ifindex = if_idx.ifr_ifindex; socket_address.sll_halen = eth_alen; socket_address.sll_addr[0] = my_dest_mac0; //...

after can phone call sendto/sendmsg 800kb / 1500 ~= 500 times , works fine, require user space <-> kernel negotiation ~ 500 * 25 times per second. want avoid it.

i seek init struct msghdr::msg_iov appropriate info, error "message long", looks msghdr::msg_iov can not describe size > mtu.

so question possible send many raw ethernet frame on linux userspace @ once?

the info (800kb) file, , read memory. struct iovec me, can create suitable amount of ethernet header , have iovec per 1500 packet, 1 point data, 1 point ethernet header.

whoa.

my lastly company made realtime hidef video encoding hardware. in lab, had blast 200mb / sec across bonded link, have experience this. follows based upon that.

before can tune, must measure. don't want multiple syscalls, can prove timing measurement overhead significant?

i utilize wrapper routine around clock_gettime gives time of day nanosecond precision (e.g. (tv_sec * 100000000) + tv_nsec). phone call [herein] "nanotime".

so, given syscall, need measurement:

tstart = nanotime(); syscall(); tdif = nanotime() - tstart;

for send/sendto/sendmsg/write, little info you're sure you're not blocking [or utilize o_nonblock, if applicable]. gives syscall overhead

why going straight ethernet frames? tcp [or udp] fast plenty , modern nic cards can envelope wrap/strip in hardware. i'd know if there specific situation requires ethernet frames, or weren't getting performance wanted , came solution. remember, you're doing 800kb/s (~1mb/s) , project doing 100x-200x more on tcp.

what using 2 plain write calls socket? 1 header, 1 info [all 800kb]. write can used on socket , doesn't have emsgsize error or restriction.

further, why need header in separate buffer? when allocate buffer, do:

datamax = 800 * 1024;  // or whatever buflen = sizeof(struct header) + datamax; buf = malloc(buflen);  while (1) {     datalen = read(fdfile,&buf[sizeof(struct header)],datamax);     // fill in header ...     write(fdsock,buf,sizeof(struct header) + datalen); }

this works ethernet frame case.

one of things can utilize setsockopt increment size of kernel buffer socket. otherwise, can send data, dropped in kernel before receiver can drain it. more on below.

to measure performance of wire, add together fields header:

u64 send_departure_time; // set sender nanotime u64 recv_arrival_time; // set receiver when packet arrives

so, sender sets departure time , write [just header test]. phone call packet xs. receiver stamps when arrives. receiver sends message sender [call xr] departure stamp , contents of xs. when sender gets this, stamps arrival time.

with above have:

t1 -- time packet xs departed sender t2 -- time packet xs arrived @ receiver t3 -- time packet xr departed receiver t4 -- time packet xr arrived @ sender

assuming on relatively quiet connection little no other traffic , know link speed (e.g. 1 gb/s), t1/t2/t3/t4 can calculate overhead.

you can repeat measurement tcp/udp vs eth. may find doesn't purchase much think. 1 time again, can prove precise measurement?

i "invented" algorithm while working @ aforementioned company, find out part of video standard sending raw video across 100gb ethernet nic card , nic timestamping in hardware.

one of other things may have add together throttle control. similar bittorrent or pcie bus does.

when pcie bus nodes first start up, communicate how much free buffer space have available "blind write". is, sender free blast much, without ack message. receiver drains input buffer, sends periodic ack messages sender number of bytes able drain. sender can add together value blind write limit , maintain going.

for purposes, blind write limit size of receiver's kernel socket buffer.

update

based upon of additional info comments [the actual scheme configuration should go, in more finish form, edit question @ bottom].

you do have need raw socket , sending ethernet frame. can cut down overhead setting larger mtu via ifconfig or ioctl phone call siocsifmtu. recommend ioctl. may not need set mtu 800kb. cpu's nic card has practical limit. can increment mtu 1500 15000 enough. cut down syscall overhead 10x , may "good enough".

you have utilize sendto/sendmsg. 2 write calls based on conversion tcp/udp. but, suspect sendmsg msg_iov have more overhead sendto. if search, you'll find illustration code want uses sendto. sendmsg seems less overhead you, may cause more overhead kernel. here's illustration uses sendto: http://hacked10bits.blogspot.com/2011/12/sending-raw-ethernet-frames-in-6-easy.html

in add-on improving syscall overhead, larger mtu might improve efficiency of "wire", though doesn't seem problem in utilize case. have experience cpu + fpga systems , communicating between them, still puzzled 1 of comments "not using wire". fpga connected ethernet pins of cpu get--sort of. more precisely, mean fpga pins connected ethernet pins of nic card/chip of cpu"?

are cpu/nic on same pc board , fpga pins connected via pc board traces? otherwise, don't understand "not using wire".

however, 1 time again, must must able measure performance before blindly seek improve it.

have run test case suggested determining syscall overhead? if little enough, trying optimize may not worth , doing may wound performance more severely in other areas didn't realize when started.

as example, 1 time worked on scheme had severe performance problem, such that, scheme didn't work. suspected serial port driver slow, recoded high level language (e.g. c) assembler.

i increased driver performance 2x, contributed less 5% performance improvement system. turned out real problem other code accessing non-existent memory caused bus timeout, slowing scheme downwards measurably [it did not generate interrupt have made easy find on modern systems].

that's when learned importance of measurement. had done optimization based on educated guess, rather hard data. after that: lesson learned!

nowadays, never seek big optimization until can measure first. in cases, add together optimization i'm "sure" create things improve (e.g. inlining function). when measure [and because can measure it], find out new code slower , have revert change. but, that's point: can prove/disprove hard performance data.

what cpu using: x86, arm, mips, etc. @ clock frequency? how much dram? how many cores?

what fpga using (e.g. xilinx, altera)? specific type/part number? maximum clock rate? fpga devoted exclusively logic or have cpu within such microblaze, nios, arm? fpga have access dram of it's own [and how much dram]?

if increment mtu, can fpga handle it, either buffer/space standpoint or clock speed standpoint??? if increment mtu, may need add together ack/sync protocol suggested in original post.

currently, cpu doing blind write of data, hoping fpga can handle it. means have open race status between cpu , fpga.

this may mitigated, purely side effect of sending little packets. if increment mtu much, might overwhelm fpga. in other words, overhead you're trying optimize away, allows fpga maintain info rate.

this meant unintended consequences of blind optimization. can have unintended , worse side effects.

what nature of info beingness sent fpga? you're sending 800kb, how often?

i assuming not fpga firmware few reasons. said firmware total [and receiving ethernet data]. also, firmware loaded via i2c bus, rom, or fpga programmer. so, correct?

you're sending info fpga file. implies beingness sent once, @ startup of cpu's application. correct? if so, optimization not needed because it's init/startup cost has little impact on running system.

so, have assume file gets loaded many times, perchance different file each time. correct? if so, may need consider impact of read syscall. not syscall overhead, optimal read length. example, iirc, optimal transfer size disk-to-disk or file-to-file copy/transfer 64kb, depending upon filesystem or underlying disk characteristics.

so, if you're looking cut down overhead, reading info file may have considerably more having application generate info [if that's possible].

the kernel syscall interface designed low overhead. kernel programmers [i happen one] spend great deal of time ensuring overhead low.

you scheme utilizing lot of cpu time other things. can measure other things? how application structured? how many processes? how many threads? how communicate? latency/througput? may able find [can quite find] larger bottlenecks , recode , you'll overall reduction in cpu usage far exceeds maximum benefit you'll mtu tweak.

trying optimize syscall overhead may serial port optimization. lot of effort, , yet overall results are/were disappointing.

when considering performance, of import consider overall system standpoint. in case, means cpu, fpga, , else in it.

you cpu doing lot of things. could/should of algorithms go fpga? reason they're not because fpga out of space, otherwise would? fpga firmware 100% done? or, there more rtl written? if you're @ 90% space utilization in fpga, , you'll need more rtl, may wish consider going fpga part has more space logic, perchance higher clock rate.

in video company, used fpgas. used largest/fastest state-of-the-art part fpga vendor had. used virtually 100% of space logic , required part's maximum clock rate. told vendor largest consumer of fpga resources of of client companies worldwide. because of this, straining vendors development tools. place-and-route fail , have rerun right placement , meet timing.

so, when fpga total logic, place-and-route can hard achieve. might reason consider larger part [if possible]

c++ c linux

Breedlove

Monday, 15 March 2010

c++ - sendmsg + raw ethernet + several frames -

No comments:

Post a Comment