换一换

文库网 > 资源分类 > PDF文档下载

预览

You only look once- Unified, real-time object detection (2016), J. Redmon et al. .pdf

资源ID：3880 资源大小：1.21MB 全文页数：10页
资源格式： PDF 下载：注册后免费下载

微信登录下载

快捷下载

账号登录下载

三方登录下载：

扫码关注公众号登录

邮箱/手机：
温馨提示：	快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
验证码：	换一换

加入VIP,免费下载

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？

友情提示

1、下载资料失败解决办法

2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。

3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。

4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

5、试题试卷类文档，如果标题没有明确说明有答案则都视为没有答案，请知晓。

You only look once- Unified, real-time object detection (2016), J. Redmon et al. .pdf

1、You Only Look Once:Unified, Real-Time Object DetectionJoseph Redmon, Santosh Divvala, Ross Girshick, Ali FarhadiUniversity of Washington, Allen Institute for AI, Facebook AI Researchhttp:/ present YOLO, a new approach to object detection.Prior work on object detection repurposes classifiers to per-f

2、orm detection. Instead, we frame object detection as a re-gressionproblemtospatiallyseparatedboundingboxesandassociated class probabilities. A single neural network pre-dicts bounding boxes and class probabilities directly fromfull images in one evaluation. Since the whole detectionpipeline is a sin

3、gle network, it can be optimized end-to-enddirectly on detection performance.Our unified architecture is extremely fast. Our baseYOLO model processes images in real-time at 45 framesper second. A smaller version of the network, Fast YOLO,processes an astounding 155 frames per second whilestill achie

4、ving double the mAP of other real-time detec-tors. Compared to state-of-the-art detection systems, YOLOmakes more localization errors but is less likely to predictfalse positives on background. Finally, YOLO learns verygeneral representations of objects. It outperforms other de-tection methods, incl

5、uding DPM and R-CNN, when gener-alizing from natural images to other domains like artwork.1. IntroductionHumans glance at an image and instantly know what ob-jects are in the image, where they are, and how they inter-act. The human visual system is fast and accurate, allow-ing us to perform complex

6、tasks like driving with little con-scious thought. Fast, accurate algorithms for object detec-tion would allow computers to drive cars without special-ized sensors, enable assistive devices to convey real-timescene information to human users, and unlock the potentialfor general purpose, responsive r

7、obotic systems.Current detection systems repurpose classifiers to per-form detection. To detect an object, these systems take aclassifier for that object and evaluate it at various locationsand scales in a test image. Systems like deformable partsmodels (DPM) use a sliding window approach where thec

8、lassifier is run at evenly spaced locations over the entireimage 10.More recent approaches like R-CNN use region proposal1. Resize image.2. Run convolutional network.3. Non-max suppression.Dog: 0.30Person: 0.64Horse: 0.28Figure 1: The YOLO Detection System. Processing imageswith YOLO is simple and s

9、traightforward. Our system (1) resizesthe input image to 448448, (2) runs a single convolutional net-work on the image, and (3) thresholds the resulting detections bythe models confidence.methods to first generate potential bounding boxes in an im-age and then run a classifier on these proposed boxe

10、s. Afterclassification, post-processing is used to refine the bound-ing boxes, eliminate duplicate detections, and rescore theboxes based on other objects in the scene 13. These com-plex pipelines are slow and hard to optimize because eachindividual component must be trained separately.We reframe ob

11、ject detection as a single regression prob-lem, straight from image pixels to bounding box coordi-nates and class probabilities. Using our system, you onlylook once (YOLO) at an image to predict what objects arepresent and where they are.YOLO is refreshingly simple: see Figure 1. A sin-gle convoluti

12、onal network simultaneously predicts multi-ple bounding boxes and class probabilities for those boxes.YOLO trains on full images and directly optimizes detec-tion performance. This unified model has several benefitsover traditional methods of object detection.First, YOLO is extremely fast. Since we

13、frame detectionas a regression problem we dont need a complex pipeline.We simply run our neural network on a new image at testtime to predict detections. Our base network runs at 45frames per second with no batch processing on a Titan XGPU and a fast version runs at more than 150 fps. Thismeans we c

14、an process streaming video in real-time withless than 25 milliseconds of latency. Furthermore, YOLOachieves more than twice the mean average precision ofother real-time systems. For a demo of our system runningin real-time on a webcam please see our project webpage:http:/ YOLO reasons globally about

15、 the image when1779making predictions. Unlike sliding window and regionproposal-based techniques, YOLO sees the entire imageduringtrainingandtesttimesoitimplicitlyencodescontex-tual information about classes as well as their appearance.Fast R-CNN, a top detection method 14, mistakes back-ground patc

16、hes in an image for objects because it cant seethe larger context. YOLO makes less than half the numberof background errors compared to Fast R-CNN.Third, YOLO learns generalizable representations of ob-jects. When trained on natural images and tested on art-work, YOLO outperforms top detection metho

17、ds like DPMand R-CNN by a wide margin. Since YOLO is highly gen-eralizable it is less likely to break down when applied tonew domains or unexpected inputs.YOLOstilllagsbehindstate-of-the-artdetectionsystemsin accuracy. While it can quickly identify objects in im-ages it struggles to precisely locali

18、ze some objects, espe-cially small ones. We examine these tradeoffs further in ourexperiments.All of our training and testing code is open source. Avariety of pretrained models are also available to download.2. Unified DetectionWe unify the separate components of object detectioninto a single neural

19、 network. Our network uses featuresfrom the entire image to predict each bounding box. It alsopredicts all bounding boxes across all classes for an im-age simultaneously. This means our network reasons glob-ally about the full image and all the objects in the image.The YOLO design enables end-to-end

20、 training and real-time speeds while maintaining high average precision.Our system divides the input image into an S S grid.If the center of an object falls into a grid cell, that grid cellis responsible for detecting that object.EachgridcellpredictsB boundingboxesandconfidencescores for those boxes

21、. These confidence scores reflect howconfident the model is that the box contains an object andalso how accurate it thinks the box is that it predicts. For-mally we define confidence as Pr(Object) IOUtruthpred. If noobject exists in that cell, the confidence scores should bezero. Otherwise we want t

22、he confidence score to equal theintersection over union (IOU) between the predicted boxand the ground truth.Each bounding box consists of 5 predictions: x, y, w, h,and confidence. The (x,y) coordinates represent the centerof the box relative to the bounds of the grid cell. The widthandheightarepredi

23、ctedrelativetothewholeimage. Finallythe confidence prediction represents the IOU between thepredicted box and any ground truth box.Each grid cell also predicts C conditional class proba-bilities, Pr(Classi|Object). These probabilities are condi-tioned on the grid cell containing an object. We only p

24、redictone set of class probabilities per grid cell, regardless of thenumber of boxes B.At test time we multiply the conditional class probabili-ties and the individual box confidence predictions,Pr(Classi|Object) Pr(Object) IOUtruthpred = Pr(Classi) IOUtruthpred (1)which gives us class-specific conf

25、idence scores for eachbox. These scores encode both the probability of that classappearing in the box and how well the predicted box fits theobject.S S grid on inputBounding boxes + conidenceClass probability mapFinal detectionsFigure 2: The Model. Our system models detection as a regres-sion proble

26、m. It divides the image into anSS grid and for eachgrid cell predicts B bounding boxes, confidence for those boxes,and C class probabilities. These predictions are encoded as anS S (B 5 +C) tensor.For evaluating YOLO on PASCAL VOC, we use S = 7,B = 2. PASCAL VOC has 20 labelled classes so C = 20.Our

27、 final prediction is a 7 7 30 tensor.2.1. Network DesignWe implement this model as a convolutional neural net-work and evaluate it on the PASCAL VOC detection dataset9. The initial convolutional layers of the network extractfeatures from the image while the fully connected layerspredict the output p

28、robabilities and coordinates.Our network architecture is inspired by the GoogLeNetmodel for image classification 33. Our network has 24convolutional layers followed by 2 fully connected layers.Instead of the inception modules used by GoogLeNet, wesimply use 11 reduction layers followed by 33 convo-l

29、utional layers, similar to Lin et al 22. The full network isshown in Figure 3.We also train a fast version of YOLO designed to pushthe boundaries of fast object detection. Fast YOLO uses aneural network with fewer convolutional layers (9 insteadof 24) and fewer filters in those layers. Other than th

30、e sizeof the network, all training and testing parameters are thesame between YOLO and Fast YOLO.780448448377Conv. Layer7x7x64-s-2Maxpool Layer2x2-s-233112112192335656256Conn. Layer4096Conn. LayerConv. Layer3x3x192Maxpool Layer2x2-s-2Conv. Layers1x1x1283x3x2561x1x2563x3x512Maxpool Layer2x2-s-2332828

31、512Conv. Layers1x1x2563x3x5121x1x5123x3x1024Maxpool Layer2x2-s-23314141024Conv. Layers1x1x5123x3x10243x3x10243x3x1024-s-2337710247710247730 4 2Conv. Layers3x3x10243x3x1024Figure 3: The Architecture. Our detection network has 24 convolutional layers followed by 2 fully connected layers. Alternating 1

32、1convolutional layers reduce the features space from preceding layers. We pretrain the convolutional layers on the ImageNet classificationtask at half the resolution (224224 input image) and then double the resolution for detection.The final output of our network is the 7 7 30 tensorof predictions.2

33、.2. TrainingWe pretrain our convolutional layers on the ImageNet1000-class competition dataset 29. For pretraining we usethe first 20convolutionallayers fromFigure 3followed byaaverage-pooling layer and a fully connected layer. We trainthis network for approximately a week and achieve a singlecrop t

34、op-5 accuracy of 88% on the ImageNet 2012 valida-tion set, comparable to the GoogLeNet models in CaffesModel Zoo 24.We then convert the model to perform detection. Ren etal. show that adding both convolutional and connected lay-ers to pretrained networks can improve performance 28.Following their ex

35、ample, we add four convolutional lay-ers and two fully connected layers with randomly initializedweights. Detection often requires fine-grained visual infor-mation so we increase the input resolution of the networkfrom 224 224 to 448 448.Our final layer predicts both class probabilities andbounding

36、box coordinates. We normalize the bounding boxwidth and height by the image width and height so that theyfall between 0 and 1. We parametrize the bounding box xand y coordinates to be offsets of a particular grid cell loca-tion so they are also bounded between 0 and 1.We use a linear activation func

37、tion for the final layer andall other layers use the following leaky rectified linear acti-vation:(x) =braceleftBiggx, if x 00.1x, otherwise (2)We optimize for sum-squared error in the output of ourmodel. We use sum-squared error because it is easy to op-timize, however it does not perfectly align w

38、ith our goal ofmaximizing average precision. It weights localization er-ror equally with classification error which may not be ideal.Also, in every image many grid cells do not contain anyobject. This pushes the “confidence” scores of those cellstowards zero, often overpowering the gradient from cel

39、lsthat do contain objects. This can lead to model instability,causing training to diverge early on.To remedy this, we increase the loss from bounding boxcoordinate predictions and decrease the loss from confi-dence predictions for boxes that dont contain objects. Weuse two parameters, coord andnoobj

40、 to accomplish this. Weset coord = 5 and noobj = .5.Sum-squared error also equally weights errors in largeboxes and small boxes. Our error metric should reflect thatsmall deviations in large boxes matter less than in smallboxes. To partially address this we predict the square rootof the bounding box

41、 width and height instead of the widthand height directly.YOLO predicts multiple bounding boxes per grid cell.At training time we only want one bounding box predictorto be responsible for each object. We assign one predictorto be “responsible” for predicting an object based on whichprediction has th

42、e highest current IOU with the groundtruth. Thisleadstospecializationbetweentheboundingboxpredictors. Each predictor gets better at predicting certainsizes, aspect ratios, or classes of object, improving overallrecall.During training we optimize the following, multi-part781loss function:coordS2summa

43、tiondisplayi=0Bsummationdisplayj=0a49objijbracketleftBig(xi xi)2 + (yi yi)2bracketrightBig+ coordS2summationdisplayi=0Bsummationdisplayj=0a49objijbracketleftbiggparenleftBigwi radicalbigwiparenrightBig2+parenleftbiggradicalbighi radicalBighiparenrightbigg2bracketrightbigg+S2summationdisplayi=0Bsumma

44、tiondisplayj=0a49objijparenleftBigCi CiparenrightBig2+ noobjS2summationdisplayi=0Bsummationdisplayj=0a49noobjijparenleftBigCi CiparenrightBig2+S2summationdisplayi=0a49obji summationdisplaycclasses(pi(c) pi(c)2 (3)where a49obji denotes if object appears in cell i and a49objij de-notes that the jth bo

45、unding box predictor in cell i is “re-sponsible” for that prediction.Note that the loss function only penalizes classificationerror if an object is present in that grid cell (hence the con-ditional class probability discussed earlier). It also only pe-nalizes bounding box coordinate error if that pr

46、edictor is“responsible” for the ground truth box (i.e. has the highestIOU of any predictor in that grid cell).We train the network for about 135 epochs on the train-ing and validation data sets from PASCAL VOC 2007 and2012. When testing on 2012 we also include the VOC 2007test data for training. Thr

47、oughout training we use a batchsize of 64, a momentum of 0.9 and a decay of 0.0005.Our learning rate schedule is as follows: For the firstepochs we slowly raise the learning rate from 103 to 102.If we start at a high learning rate our model often divergesdue to unstable gradients. We continue traini

48、ng with 102for 75 epochs, then 103 for 30 epochs, and finally 104for 30 epochs.To avoid overfitting we use dropout and extensive dataaugmentation. A dropout layer with rate = .5 after the firstconnectedlayerpreventsco-adaptationbetweenlayers18.For data augmentation we introduce random scaling andtra

49、nslations of up to 20% of the original image size. Wealso randomly adjust the exposure and saturation of the im-age by up to a factor of 1.5 in the HSV color space.2.3. InferenceJustlikeintraining,predictingdetectionsforatestimageonlyrequiresonenetworkevaluation. On PASCAL VOCthenetwork predicts 98 bounding boxes per image and classprobabilities for each box. YOLO is extremely fast at testtime since it only requires a single networ

注意事项: 本文（You only look once- Unified, real-time object detection (2016), J. Redmon et al. .pdf）为本站会员（刘岱文）主动上传，文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知文库网（点击联系客服），我们立即给予删除！