You only look once- Unified, real-time object detection (2016), J. Redmon et al. .pdf
《You only look once- Unified, real-time object detection (2016), J. Redmon et al. .pdf》由会员分享,可在线阅读,更多相关《You only look once- Unified, real-time object detection (2016), J. Redmon et al. .pdf(10页珍藏版)》请在文库网上搜索。
1、You Only Look Once:Unified, Real-Time Object DetectionJoseph Redmon, Santosh Divvala, Ross Girshick, Ali FarhadiUniversity of Washington, Allen Institute for AI, Facebook AI Researchhttp:/ present YOLO, a new approach to object detection.Prior work on object detection repurposes classifiers to per-f
2、orm detection. Instead, we frame object detection as a re-gressionproblemtospatiallyseparatedboundingboxesandassociated class probabilities. A single neural network pre-dicts bounding boxes and class probabilities directly fromfull images in one evaluation. Since the whole detectionpipeline is a sin
3、gle network, it can be optimized end-to-enddirectly on detection performance.Our unified architecture is extremely fast. Our baseYOLO model processes images in real-time at 45 framesper second. A smaller version of the network, Fast YOLO,processes an astounding 155 frames per second whilestill achie
4、ving double the mAP of other real-time detec-tors. Compared to state-of-the-art detection systems, YOLOmakes more localization errors but is less likely to predictfalse positives on background. Finally, YOLO learns verygeneral representations of objects. It outperforms other de-tection methods, incl
5、uding DPM and R-CNN, when gener-alizing from natural images to other domains like artwork.1. IntroductionHumans glance at an image and instantly know what ob-jects are in the image, where they are, and how they inter-act. The human visual system is fast and accurate, allow-ing us to perform complex
6、tasks like driving with little con-scious thought. Fast, accurate algorithms for object detec-tion would allow computers to drive cars without special-ized sensors, enable assistive devices to convey real-timescene information to human users, and unlock the potentialfor general purpose, responsive r
7、obotic systems.Current detection systems repurpose classifiers to per-form detection. To detect an object, these systems take aclassifier for that object and evaluate it at various locationsand scales in a test image. Systems like deformable partsmodels (DPM) use a sliding window approach where thec
8、lassifier is run at evenly spaced locations over the entireimage 10.More recent approaches like R-CNN use region proposal1. Resize image.2. Run convolutional network.3. Non-max suppression.Dog: 0.30Person: 0.64Horse: 0.28Figure 1: The YOLO Detection System. Processing imageswith YOLO is simple and s
9、traightforward. Our system (1) resizesthe input image to 448448, (2) runs a single convolutional net-work on the image, and (3) thresholds the resulting detections bythe models confidence.methods to first generate potential bounding boxes in an im-age and then run a classifier on these proposed boxe
10、s. Afterclassification, post-processing is used to refine the bound-ing boxes, eliminate duplicate detections, and rescore theboxes based on other objects in the scene 13. These com-plex pipelines are slow and hard to optimize because eachindividual component must be trained separately.We reframe ob
11、ject detection as a single regression prob-lem, straight from image pixels to bounding box coordi-nates and class probabilities. Using our system, you onlylook once (YOLO) at an image to predict what objects arepresent and where they are.YOLO is refreshingly simple: see Figure 1. A sin-gle convoluti
12、onal network simultaneously predicts multi-ple bounding boxes and class probabilities for those boxes.YOLO trains on full images and directly optimizes detec-tion performance. This unified model has several benefitsover traditional methods of object detection.First, YOLO is extremely fast. Since we
13、frame detectionas a regression problem we dont need a complex pipeline.We simply run our neural network on a new image at testtime to predict detections. Our base network runs at 45frames per second with no batch processing on a Titan XGPU and a fast version runs at more than 150 fps. Thismeans we c
14、an process streaming video in real-time withless than 25 milliseconds of latency. Furthermore, YOLOachieves more than twice the mean average precision ofother real-time systems. For a demo of our system runningin real-time on a webcam please see our project webpage:http:/ YOLO reasons globally about
15、 the image when1779making predictions. Unlike sliding window and regionproposal-based techniques, YOLO sees the entire imageduringtrainingandtesttimesoitimplicitlyencodescontex-tual information about classes as well as their appearance.Fast R-CNN, a top detection method 14, mistakes back-ground patc
16、hes in an image for objects because it cant seethe larger context. YOLO makes less than half the numberof background errors compared to Fast R-CNN.Third, YOLO learns generalizable representations of ob-jects. When trained on natural images and tested on art-work, YOLO outperforms top detection metho
17、ds like DPMand R-CNN by a wide margin. Since YOLO is highly gen-eralizable it is less likely to break down when applied tonew domains or unexpected inputs.YOLOstilllagsbehindstate-of-the-artdetectionsystemsin accuracy. While it can quickly identify objects in im-ages it struggles to precisely locali
18、ze some objects, espe-cially small ones. We examine these tradeoffs further in ourexperiments.All of our training and testing code is open source. Avariety of pretrained models are also available to download.2. Unified DetectionWe unify the separate components of object detectioninto a single neural
19、 network. Our network uses featuresfrom the entire image to predict each bounding box. It alsopredicts all bounding boxes across all classes for an im-age simultaneously. This means our network reasons glob-ally about the full image and all the objects in the image.The YOLO design enables end-to-end
20、 training and real-time speeds while maintaining high average precision.Our system divides the input image into an S S grid.If the center of an object falls into a grid cell, that grid cellis responsible for detecting that object.EachgridcellpredictsB boundingboxesandconfidencescores for those boxes
21、. These confidence scores reflect howconfident the model is that the box contains an object andalso how accurate it thinks the box is that it predicts. For-mally we define confidence as Pr(Object) IOUtruthpred. If noobject exists in that cell, the confidence scores should bezero. Otherwise we want t
22、he confidence score to equal theintersection over union (IOU) between the predicted boxand the ground truth.Each bounding box consists of 5 predictions: x, y, w, h,and confidence. The (x,y) coordinates represent the centerof the box relative to the bounds of the grid cell. The widthandheightarepredi
23、ctedrelativetothewholeimage. Finallythe confidence prediction represents the IOU between thepredicted box and any ground truth box.Each grid cell also predicts C conditional class proba-bilities, Pr(Classi|Object). These probabilities are condi-tioned on the grid cell containing an object. We only p
24、redictone set of class probabilities per grid cell, regardless of thenumber of boxes B.At test time we multiply the conditional class probabili-ties and the individual box confidence predictions,Pr(Classi|Object) Pr(Object) IOUtruthpred = Pr(Classi) IOUtruthpred (1)which gives us class-specific conf
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- You only look once Unified real time object detection 2016 Redmon et al
链接地址:https://www.wenkunet.com/p-3880.html