Hey there everybody. I think we’re live now. I do believe we’re live on the wild and woolly world of Al Gore’s internet. It’s Thursday morning. It’s 9:00 a.m. and you know what that means. It is time for OpenCV Live. We’ve got a few folks popping in already over on YouTube. Love to see it, folks. And uh we’re live on Twitch, LinkedIn, YouTube, and Zoom. I’m here with uh the homie from Rooflow who’s I think this is is this your second or third episode of the show? I I think second. I think second. Okay. Well, welcome back. We’re pleased to have you to talk about this awesome bit of technology from longtime OpenCV supporters. Rooflow RF DTER has been a huge huge thing over the last you know year especially and uh lots of people were talking about it at CBPR this year. Um, I’m like, “Hey, I know those guys. They do they do good stuff.” And so, we decided to have him on the show here to to do a little chitchat. Please, as you join us, wherever you’re watching us, let us know where you’re joining us from. I am coming to you from beautiful Tijuana, Mexico here south of the border. And it’s uh looking to be a pretty dang nice day outside. Maybe we’ll take the dogs for a walk today. Um, looking forward to that. But let us know where you’re joining. Looks like we got Nevada Tahas here. Mathagorum chiming in. Uh we’ve got Jurgen on YouTube says, “Cheers. Awesome work since years.” We got uh and Jurgen’s coming from Munich. Yeah. Right on. It’s an international affair here on OpenCV Live as per our usual. Um, so over on the Zoom chat, we’ve got uh Stefan from Vancouver, Canada. We’ve got uh Milos from Serbia. We’ve got Zack Crane admitting to be coming to us from Iowa, which is a brave move on on Zach’s part. I’m I’m We appreciate the fact that you um stumbled out of the cornfield to join the show today. We got Allan from Austin, Texas. Love to see it.
We’ll get started in just a minute here as we let people uh get into their viewing situation. And as you can probably hear one of the dogs yaking on something in the background there. Sorry about that. Poor uh Fluffy. She’s on some meds now, which are helping her out. We got uh Portugal. We’ve got Slovakia. We’ve got India. We’ve got Kiev. We’ve got the other Vancouver in Washington. Lots of folks in today. Yeah. Got Hungary. We’ve got Picash from California.
Uh we’ve got uh somebody joining on a train in Germany on the way home. Awesome. Right on. Thanks for making time for us in your commute, uh, Valentin. We appreciate it. We got India as well. Uh, somebody says, “I love you.” Oh, that’s very that’s very kind.
Thanks so much everybody for uh participating in our little where are you joining from segment this morning. It’s awesome to see all of you. It’s one of the best things about this show. It’s the Open CV audience and community are so freaking cool. Looks like we’ve got Caro from Pittsburgh and then he’s uh but he’s based in Austin. We’ve got uh Andre from Moscow uh in uh Kimiki Kim Kim Kim Mickey I think. Uh we’ve got Molly. We’ve got St. Louis, Missouri. We’ve got uh Riad, we’ve got Chicago, we’ve got Birmingham, and we’ve got Olaf from Costa Rica. Thank you guys.
Going to slightly tweak my microphone here. I think uh you’re getting a little too much background noise. Nobody needs to know that much about what’s happening here.
All right, I turn I turn turn it back on here. Hopefully, it should be better now. All right, so um I think now is a a great time to get started. Let me go ahead and solo myself and welcome everyone to the show.
Hey there. Hi there. Ho there everybody and welcome to Open CV Live. It’s Thursday 9:00 a.m. and it’s time for our show. We’ve got a great one for you today. We’re going to be talking about RF DTER key points, the new amazing preview release from longtime OpenCV supporters, Rooflow. This one is really exciting. It is proving to be uh potentially even more significantly more accurate than YOLO for various tracking instances. The one they’ve got now is is key points. And uh we’ve got our friend from Rooflow to come talk about that with us today. But before we get started, I’ve got a few things that I need to talk to you about. The first of which is a reminder that OpenCV is a nonprofit organization that releases open-source software. And as such, we depend on the support of our members and sponsors, all of which are on the screen right now. We want to give a big thank you and shout out to ARM, Futureway, Google Summer of Code, Rooflow, Orbit, the British Machine Vision Association, Jet Brains, Intuitivo, The Edge AI and Vision Alliance, Open MV, Tangram Vision, Amped Software, Intuitivo, Rerun, Intrinsic, Beeris Dev, and Big Vision because every company needs a big vision. If you want to be as cool as these companies are, the best way to do that is to scan the QR code at the bottom of your screen and sponsor Open CV, you can Where’ the Dang it, where’d it go? There it is. Scan the QR code at the bottom of the screen and sponsor OpenCV. You can donate as an individual. Or if you work for a large organization, it’s very possible that your boss can turn your donation to OpenCV into a 2x donation. For example, if you donate a h 100red bucks and your company participates in donation matching through a system say like uh Benevity, which OpenCV is a part of, that 100 bucks can turn into 200 bucks in the blink of an eye just by talking to your boss about it. We hope you’ll do that. But even if you can’t do that, there’s other ways you can support OpenCV, such as by buying OpenCV merchandise. We’ve got official t-shirts, hats, tote bags, etc., and bandanas for all of your pets and or children. You can also sponsor us on GitHub at uh github.com/sponsors/opencv.
Highly encourage you to do that. It is a great way to show your support and get a little bit of uh uh social capital out of having your name on the OpenCV supporters page, which I will read out a little bit later in the episode. So, please scan that QR code, help out OpenCV however you can. As I said, we’re a nonprofit that makes open source software, and there are not a lot of us left, especially in the computer vision industry where a lot of companies are even taking what was open- source and making it uh closed source or restricting the source to for various purposes. Uh we don’t want to do that. We’re not going to do that. We are here for you. We are by the people, of the people, and for the people. And we appreciate every little bit of your support. So, we’re also taking questions from you in the audience. Please use the Q&A button if you’re watching on Zoom to ask your question at any time or just post it in the chat if you’re watching us on YouTube, LinkedIn, Twitch, etc. I’ll be monitoring those chats and I’ll bring up those questions as the show progresses. We’ll also save a little bit of time at the back end there to answer anything we didn’t get to. So, please do that. We’ve also, as always, got our trivia giveaway later on in the episode. I will be asking a trivia question based on today’s presentation and the first person to answer that question correctly will win the OpenCV University course of their choosing. You can see what courses are on offer by going to opencv.org/un university. We hope you will. It’s a great service and uh we have a lot of success stories on there for you to check out. Maybe you’ll be the next OpenCV university success story. So stay tuned for trivia and check out OpenCVU. That is about enough of my yappen today. I think a couple more folks chiming in. We’ve got uh hello from Liberia. We’ve got Bavaria. Um lots of the IA countries chiming in here. We’ve got uh Germany, Poland, got Austin, Tahas, and we’ve got uh Wesley from Oregon. Notice I I pronounced it correctly. Wesley Oregon. You’re welcome, buddy. All right. I think now is a great time to get started. My friend, introduce yourself. Remind everybody who you are and tell them what you’re going to be talking about today. Uh, hi everyone. My name is, uh, Peter. I work at Rublow. Um, and more or less all I do is open source. Um, and yeah, today I’ll be talking to you about our PTR and specifically about the new release that we made. Uh we added key points this week. So that’s the main topic.
Indeed. Uh you can go ahead and share your screen using the button at the bottom of the screen there whenever you want and I’ll pop it up. Let me do that. Let me do that. Share screen.
Dar she blows. If you’ve got audio on here, I recommend turning it. We don’t have audio, but we have some videos. Uh, smart man. Yeah, finger crossed everything will work. Um, cool. So, I guess I’ll just proceed. Um, yeah. So um last year uh we’ve released uh RVtr it’s uh it was but it’s no longer is uh only a object detector uh right now we support more tasks and that’s why we meet today here. So uh RFDTR as u the object detection part is state-of-the-art object detector uh beating uh other top choices for object detection uh both in terms of speed and accuracy. Um so here’s the benchmark on the cocoa data set where you would like to be here is in the top left corner that would mean that you are highly accurate and fast model uh for object detection. Um yeah, that’s uh what we aim for. But not only that, um another kind of like a benefit of RFDTR is uh let’s go NX. Yeah, I agree. Um uh another benefit of RDTR is that we are very good at uh fine-tuning. Um so here’s another benchmark that we internally created. Those of you who know Rooflow, you probably know that we have a lot of data sets on the platform. Those data sets are uh created and shared by our users. Um so what we did is we picked 20 of those data sets. We uh took a look at you know uh what’s inside and and um um collected sorry not 20 but 100 data sets divided them uh into into buckets and um decided to check how well different uh detectors can fine-tune of on those data sets. Uh and it turned out um is the top in that category as well. Uh so uh here is kind of like the average score. Um but if we took a look at uh individual um buckets uh we can kind of like dive a little bit deeper. So on average you can see that for example if you would choose uh YOLO 26 um and RDTR at the same uh speed you can on average gain around 2 map. Uh so nothing else changes. All you did is you swap model from one to another and during the inference uh you can you know kind of like for free get um this accuracy boost but depending on on the category on the of the data set you can uh get even more. One of my favorite uh categories is aerial. Um and on aerial data sets, RDTR uh scores uh very often around even like five map points higher um than uh you know those other object detectors. The reason for that is um RDTR uses Dino V2 backbone. Um so it uses uh also open source model from Facebook as a backbone. That model was trained on like crazy crazy amount of images uh very diverse um images and that helps the model to learn fast because it already saw a lot of those things in the past um where contrary to uh you know other uh popular object detectors that are pre-trained ImageNet or maybe on Koko all they all they saw is like a very like a very very narrow u part of our life and dino saw uh aerial images so medical images so a lot of other things so um that knowledge is already in RFDTR because it’s already in Dino V2 and during training we just we just you know get that information that is already there um so that’s that’s an example of what you can do with uh RDTR with like aerial images uh or aerial videos Um, and the model trains a lot faster uh than other uh open source models. Um, so it it it is slower per epoch, but you need I don’t know like five epochs to already get like a very very good accuracy that you would need to take probably like 25 or even sometimes 50 epochs uh to get with other open source models. Uh so that’s a significant amount of compute to save. I mean that’s that’s no joke. Yeah, exactly. So So um this is like one of the main I would say benefits of the detector that that we released is that it’s really um really good at like quickly getting to proper results. Um, and you can see like if you compare like yellow models uh and RFDTR when you train, it’s usually like RFDTR is like a steep wall during the first five epochs and getting to like very very reasonable uh accuracy and it takes YOLO a lot of time um to get there. Another uh interesting thing is that uh you can get those results with without almost any augmentations. So those of you who are familiar with um how we did object detection for the past uh several years, many of those training grants, it was just like a creative way to augment images um to introduce enough variance in your data set that the model um can learn and uh generalize later on. uh RDTR does not use augmentations at all. Uh and still can learn pretty fast because like I said that knowledge is already there. All we do is we just look for you know where like try to get to that information that was already in that backbone. Um and yeah uh then few months later we released the segmentation uh model and the segmentation model is uh once again uh state-of-the-art when it comes to speed and accuracy especially when applied to uh sporting events from ESPN 8 OO. Yeah, I mean uh that’s you know one of the ways that they like to play with those models uh for sure is is applying that to sports. Um yeah, that that one is very interesting the uh word chase tag. I believe that was professional tag, right? Wasn’t that guys who who participate? Yeah, those are uh like parkour uh right and then they are doing crazy things. Yeah. So it’s actually pretty hard you know to track them because they are move in a very unpredictable way but that’s a that’s a separate topic. Um so yeah so yeah we released segmentation mode uh once again getting um getting state-of-the-art accuracy and yeah a lot of people um asked us if we plan to release key points model and uh this week we released the preview version of the key points model that’s what we did with segmentation model in the past so instead of just going all in and get giving people u all sorts of ways. What we do is we we prefer to give people like a single checkpoint, ask them how do they use that, you know, uh what can we do better and then release the actual um like a proper proper release few weeks later. So that’s definitely coming. We intend to give people like a full uh range of uh checkpoints. I think four or five uh different sizes, but for now we released the the largest one, the one that uh you can compare with uh X um size of uh popular uh YOLO models. Now here is like a visualization. interesting part about this model I think is uh like anybody who used key points in the past they are very familiar with u like a p if you if you do key point detection uh those models like like yolo they will give you coordinates of your skeleton so for every so skeleton contains multiple anchors and for every anchor you will get x and y and you will also to get like a float uh that represent the confidence of the model. So it tells you like how confident am I to the given key points is uh present and is there. Um so what we did is we um approached this problem slightly differently. Instead of giving you the key point what we give you as um like a spatial con confidence. So what we what we do is we uh give you xy coordinate plus this uh spatial confidence that is uh visualized um as ellipse. Um and the the broader the ellipse is the least confidence uh the model has in location of that point. Um and of course ellipse can be uh you know can he can can have u axises that are uh you know uh dimensionally wise very close to each other or not and that can also represent how confident the model is about the presence of the given point in a specific direction which is also pretty cool. um you can use that information and you can probably plug it into like a common filter downstream to extract even more information about that. Um but most importantly it gives you a lot more uh like a lot more information because before all you get is a float and okay if I do the thresholding at like 0.3 I all I know is that oh here are the points that the model is confident about and okay I don’t know anything more I all I know is like it it might be here but the model doesn’t really know. So here what we have is like um it’s actually calibrated. So I can tell you um like for example here those dotted ones are at uh uh sigma uh two uh and um if if we look at those those that that have like more than sigma 2 those are the the ones that are not visible. Um and also uh if you if you look at the cocoa and you would benchmark that that uh uh and calculate you would learn that if I’m not mistaken uh 41% of the points are always within uh sigma 1 and like 83% of the 86% of the points sorry are uh within sigma 2. Yeah. So that gives you like a a pretty good information about like how confident the model is that the point is actually within um that location. Um and that information is super useful. I mean anybody that’s used some of these key point tracker detection tools knows how often they will sort of you know lose confidence or you’ll drop a point or or or the point will move significantly. And so that’s a that’s a huge improvement here. Yeah. Yeah. So, here here’s an example of I’m I’m just in my my kitchen and I’m just uh kind of like rotating and you can see the mo the moment when I’m kind of like my side is towards the camera is the moment where obviously the model is least confident about the position of the uh of the point and you can see that that okay I mean I I know that like a general location um but I’m not that confident that the location is not that narrow. Um also like interesting if you would go back a few slides um yeah that might be interesting. So here here we there are also those ellipses but we don’t see them because they are so small. So what was also happening that depending on um so maybe maybe let’s say uh differently when you annotate images of people you can be a lot more certain about the location of an eye because it’s very easy to like it’s very easy to see so when you annotate you just click the same with the nose yeah because it’s pretty obvious where it is but for example with heaps or shoulders especially with heaps I would say uh those ellipses are usually a lot uh you know elongated a lot a lot larger. uh and that’s because just way harder to annotate so that there is a a lot higher level of like variance within the training data and you know ultimately that get transferred into your model and when you kind of like train you can see that the mold learns okay like in ter like when when it’s eye very easy for me to locate that I’m pretty sure it’s in this like very very narrow part of an image and with heap is like you know it’s hard to tell. There is like a you know uh different uh posture, different uh different clothing, you know, it’s probably somewhere over there and it learns that from the data. Another very interesting thing is that like I said all of that is learned um uh and that is important because then when you apply this model and you fine-tune it on other data set um we will also learn the distribution from that data set. So it doesn’t matter if if you have uh pose estimation or a very popular use case that I’m uh using key points model to is like to locate characteristic points on like a football field or basketball court. Um it will also learn uh that distribution. Um, an interesting thing that people maybe not know is a lot of those uh keyoint models those popular like open source let’s call them this way uh keyoint model they they almost like hardcode the information about the amount of key points and the distribution of those key points and the level of certainty into their their architecture and their loss function. That happens certainly with uh some popular YOLO models. they they literally add this information like 17 points and eyes are you know usually this uh certain and heaps are usually so all of that is hardcoded into the architecture. So you can imagine that when you apply that uh model later on and you fine-tune it on a different data set that is completely different, all of that information is definitely not helping probably hurting your training and making it harder to train and because our DTR just learns everything from the distribution of data. U then it’s actually useful and you can later on use it during the inference but it also helps during the training. Um and another uh interesting kind of like a property of uh uh our like transformer model it it was working very similarly with detection and segmentation is that when uh you can actually use a different resolution uh as an input for the model. So by default um the extra-large model that we released is uh 576 pixels u square but you can decide that you would like to either lower the input resolution or uh make it larger and once you do it uh this kind of like curve is uh is being generated because you using a different resolution will impact the speed of your model. So that whole curve is created from a single checkpoint um which is located over here for the default configuration. But you can change that configuration during the inference and that can uh either increase your accuracy or lower your accuracy uh and also impact your your training uh your um sorry inference speed. So yeah like with everything it’s it’s all about the data you know uh garbage in garbage out. classic adage still holds true. We’ve also got a special guest here. Please say hi to Fluffy. Fluffy. Fluffy is very interested in computer vision. Yeah, my dog just came back from the from the walk. I I hear in the background. So yeah, he’s also somewhere over there. Very interested in computer vision as well. Um and that’s that’s pretty much it. If you would like to use RFDTR, you can you can use it through our open source package. The QR code that you see right now on the screen will lead to the to the repo. We highly appreciate and stars if you are there. Uh and uh if you want to get even better performance um then yeah I highly encourage you to um to use our Roboflow version of RFDTR because the important thing is um uh our RFDTR comes with NAS neural architecture search but the open source version does not have that. The product version have that. So on average you can usually expect another two maybe 5 map uh boost just by using that um that NAS. I can tell you more about NAS if you want but uh that’s pretty much it. That’s pretty much it. Thank you for all that killer info on RF uh DTOR. It’s uh we saw at CBPR tons of booths were just using this as their demo. There were there were a bunch of them like I recognize that model. I know exactly what you’re doing. Um this was I think before the key point preview but um still really cool product but it was not an open source yet. I uh I was so GPR was kind of like two weekends ago. Uh last weekend I was on another uh conference this time in Europe and tons of tons of people uh told me they’re using our DTR in their product because what is important and I haven’t said that it’s it’s Apache 2. It’s like no strings attached Apache to license which which you know like you said it’s not very very common common in computer vision uh anymore. So it’s pretty yeah these days uh it seems like yeah I would love to hear that too. The it’s a very permissive license. You know Apache 2 is great for uh putting out software that you want people to use. You know you can say go use this for a commercial product. Use this for an open source product. Do do what you will with it. All you got to do is tell people what you used. And I love that. Um so another uh ribflow also had another relatively uh another pretty big release right supervisor had some upgrades. Is that is that the case? Uh supervision I guess. Yeah. Yeah. Supervision. Yeah. Sorry. So uh what happened is that we uh yeah key points was exist. So supervision is our library that we use to power uh our product our demos. So every demo that for those of you who who follow me and aware I blew a lot of demos that’s part of my Jeff and all of those demos are powered by supervision. Um and yeah uh key points were in supervision uh before but you know we’re kind of like the afterthought and because we release uh the preview version of the keyoint mode we we added a tons of new features new annotators new utils you know improved the user experience around this particular part of the library and um we will uh I already know we will release another release around key points soon because uh we built a lot right now with key points and all of those improvements are landing into supervision. So yeah that that was that was the release the it it powers it actually powers our DTR uh package as well. So when you use key points you actually use supervision internally and uh in product you use supervision and in other libraries that we have use supervision. So it’s kind of like a workhorse that we have. Okay. I didn’t know that. That’s good information to have. Um, please scan that QR code at the bottom of the screen, folks. Try out Rooflow today. They’ve got a great free plan. You can try out a bunch of their awesome tooling. Rooflow makes some of the absolute best tools in the business, if not the best tools in the business. And uh, they’re a good group of folks, some really nice, their customer support people are always really cool as well. Um, can’t recommend Rooflow enough. Today’s episode is in fact brought to you by Rooflow. Scan that QR code, sign up. Um, and also OpenCV gets a little kickback if you become a paying customer of Rooflow, which we encourage you to do. So, we think, uh, as soon as you see what Robo Flow can do for you, you’ll want to start paying them for the privilege. And so, uh, when you do that, OpenCV gets a little bit of money back. So, support open source computer vision and get some great tooling as part of the deal. Um, so, uh, Peter, there’s, uh, I saw I’ve seen on LinkedIn quite a few good, uh, sort of third party like demos. I know a lot of people are using, uh, RFD, uh, DTR. Uh, do you can you talk a little bit about some of the cooler, um, uh, sort of third party implementations you’ve seen out there? Like what kind of cool stuff are people using this for in the wild? Um, sure. So, so first of all, um because RDTR is like I said a patch tool, uh that uh makes it uh possible for other libraries to also um other DTR. So, uh it’s now available in uh transformers. So, if you would like to use it for transformers package, it’s possible. Uh we are also working on making that super easy to use RFDTR in mobile uh apps. Um so it’s right now available in uh React Native Exeutor library. So they have all the sizes um uh in different tasks even the key points task available there. So if you build mobile app, you can pretty much um almost like drag and drop your model into your uh into your library. So that’s from the open source perspective. It’s getting a lot more uh broad adoption. Um now when it comes to u industry uh yeah tons tons of people uh use it for all sorts of like smart city use cases. So cases where you would like to know uh how many people how many cars uh are in a given location or moved from point A to point B. Um tons of use cases there. Um sports from what I heard it’s another one. Um a lot of use cases working with CCTV cameras. a lot of use cases working with uh um manufacturing. So you know camera over conveyor and um you need to count objects, you need to make sure that they are intact. um plenty of use cases around like transport uh so uh uh shipping yards um stuff like that that that’s also very popular. So generally any place where uh you would like to build a product and you would like to have a model with open source license um but also um get high speed and high accuracy because that’s important. Um and yeah that’s that’s I that that is pretty much it I think. Yeah. Yeah. Right on. And as I said some cool videos over on uh the LinkedIn account. you’re a great LinkedIn follow because you’re always posting cool demos with a lot of this tech. Um, this this I’m not an
I’m saying something. Yeah. Yeah. Yeah. It’s a it’s a a careful line you got to walk between uh sometimes you get it’s it’s possible to be too excited about your own work, you know, and like put it out too much, but I think you you do a good job. Um we got a bunch of questions here from the audience. This was uh yeah this was uh generated a lot of yeah generated a lot of discussion here. Um here’s uh let’s see first one here we’ll do the new the most recent one cat 5D dev. Um first they say heart rooflow. We heart roofflow as well. Um but they also ask you mentioned the currently released pose estimation key points would be the biggest available. Are there any plans to add hand or digit key points in the future? So, absolutely. Yes, absolutely. Um, the idea that we have is that we would like to add more sizes first of all. Um, and like I said, like the preview is for us to collect ideas from the community. Uh, look for any improvements that we can do. We already have several um things that we will change for the next release most likely. Uh so you can expect more sizes but you can also expect that we will release uh key points models for different uh use cases. Uh and we’ll have pose estimation and we’ll have hand uh gesture key points as well. um a little bit uh I’m a little bit unsure about the license of that model because those um data sets uh that exist out there for hand gesture have pre u yeah they are okay for research uh but not for enterprise so I’m unsure if we will be able to release those checkpoints under Apache tool like officially it’s not because we don’t want to it’s because we we have hard time to locate locate the data as you said data is the king so it’s pretty pretty important that the license is okay there but we will certainly release um uh checkpoints for uh for hand uh key points and if if we’ll find the data set then that that will be okay uh then that will be a patch to release as well yeah so you can expect all of that to happen the next like weeks months Um, for sure we will release the um the post estimation key points first. All of them. Got it. Sounds like there’s a lot coming down the pike. Uh, looking forward to that. We’ve got a bunch more questions here. Thanks for the question. Cap 5D. Got Eric Feno asks um a similar question. Do you think about doing foot key points? Heel, big toe, small toe. Is that something that’s in the been discussed as well? Uh I’m not sure if so there are two things separate. So there’s like uh what I will be doing and there is what the research team will be doing. Um so my intention is totally to uh to try that. Um there is a cocoa data set that contains this information. So for the reference um the key model that released is 17 key points uh on the body. That means that for example when it comes to hands you only get a key point in your wrist and when it comes to uh your I mean wrist you know uh um shoulder uh and Jesus I am I forgot how how you call this but elbow. Elbow. Thank you very much. I’m not the native speaker and you’re doing great. my hour 11 of working but um elbow and obviously hips and obviously knees and obviously ankles um so all together with face key points that’s 17 but uh there are other data sets out there and very often those data sets I you know have like 21 points per per your hand and another like six points per your foot that allows you to dramatically increase the accuracy of your model especially for like a think I’m I’m thinking always about like sports use cases when I’m doing like football analysis or soccer analysis depending on where you live. Uh it’s pretty hard to do any accurate analysis with like 17 key points because all you know is is where is your ankle but if you have those another six points per foot then you can do pretty pretty accurate analysis. So, uh, coming back to your question, um, I’m not sure if we will do like a official release of that checkpoint, but I personally absolutely intend to fine-tune uh, RDTR on uh, that data set and releasing that checkpoint. Um, yeah, for everybody to just have fun and be able to do cooler stuff. Uh maybe we’ll do like some sort of like a fine-tuning tutorial around this idea. I’m not sure yet, but um I’m certainly super interested in in in doing that. Thank you for that. And we appreciate you uh extending your workday to educate us here. I was just I was just explaining why I cannot speak English anymore. No, it’s okay. Not that I’m complaining or anything. You have a much better excuse than I do. I mean, I’m I’m a native English speaker and I just got up, so Yeah. Yeah. Yeah. You surprised me that you’re in Mexico, but Yeah. Yeah. I love Mexico. Uh, you Mexico. So, we’ve got a few more questions. I’m going to break up the questions here for one moment and do our trivia giveaway. So, folks that are longtime watchers know there’s something we do on every single episode of this show and it’s give away something to you out there in the audience. Today we’re giving away a free OpenCV University course. You can see what courses are on offer by going to opencv.org/university or by scanning the QR code at the bottom of the screen appearing directly underneath my face. That’s opencvuniversity.org opencv.org/university or just scan the QR code. Today’s trivia winner will win the OpenCV University course of their choosing. See all of them that are available there on the website. If you have won in the last couple of months, don’t answer. But do feel free to answer wherever you’re watching. We’re talking Twitch, LinkedIn, Zoom, and YouTube. The first person to answer will win today’s giveaway. Get ready to answer now. So, today we talked about some benchmarks and how uh the new Rooflow release is uh better than uh many of the top stuff out there in what it’s trying to do. One of those was MS Coco. What does MS Cocco stand for? What is the acronym MS CO stand for? Put the whole thing in the chat now and you will win an OpenCV university course.
I’m watching the chat. Don’t disappoint me today, folks.
All right, we got one. We got uh Zack Crane, our Ian um for the for the episode. Uh it stands for Microsoft Common Objects in Context. Zack was the first one to get there. Congratulations, Zack Crane watching on Zoom from Iowa. Please send one email to me. That’s Phil opencv.org with the name of the course you would like. Once again, see those courses by scan the QR code at the bottom of the screen and choose one. send me an email about it and we’ll make sure that you get that unlocked in your OpenCV University dashboard. Congrats, Zack. I think this is the first time you’ve won, Zack. You’ve been watching for a while. Uh, love to see it. Thanks for joining us again. And we’ve got a bunch more questions here. So, let’s pop open the question tube once more.
You guys got that one pretty quick. Uh, we also also Lowi time on YouTube. Cat 5Dev on YouTube. We were you were this close, folks, but you just didn’t quite get there. Um, wow. Yeah, so many questions. Okay. Uh, let’s see. We’ve got Muhammad uh Elbaz on LinkedIn asking, “How many classes can RFDR detect in model object detection? How many classes?” Speaking of cocoa, uh, it’s pre-trained on Cocoa. So it can detect the standard 80 classes that uh you are probably familiar from other open source object detectors. Like I said it’s on it’s it’s kind of like it’s pretty like the checkpoint that you get is trained on cocoa but the knowledge about other things is there in the model. So uh you can still fine- tune it but out of the box 80 class. Okay thanks for that. And we’ve got uh so you covered this question a little bit in the presentation that the talk earlier um but what are the big differences between RFDTR and YOLO? Uh that’s sounds like I’ve not narrowed down to key points only. So I I will I will answer as uh kind of like generally uh not necessarily uh only for key points. So what are the differences? Uh pretty significant ones uh to be honest with you. Uh it starts with the architecture. Uh so y architecture is the convolutional neural network. Um pretty much it’s you know staple of computer vision for the past I don’t even know probably like 10 years or so uh maybe even more than that. um in some way or another. Um and uh RDTR is global flow detection transformer. Uh that means that we we borrow a lot from uh from from what is happening in transformer space right now. Um so that’s the difference like the architecture is the difference. Um that has consequences. Yeah. So uh for example the consequence uh of of that architecture is that connets uh like yellow requires non-max suppression which is this kind of like mechanism that you need to apply at the end to figure out which boxes are duplicates. You know which boxes um kind of like you can kind of like remove and only keep those that are the key ones the most important ones. the transformers architectures don’t need that kind of like they learn which boxes are important and which not. So it’s kind of like a very unified architecture. You can um uh everything happens in a single uh swoop. That means that for example when you export the model uh those models that require additional NMS yeah you exported that model let’s say to onx but you still need to apply normal expression at the end uh when it comes to transformers if you convert to onx that kind of like that mechanism of removing duplicates very much embedded embedded into the model uh of course kind of like on a on a different side of the spectrum Um uh we we made this transformer that is to be extremely fast but it’s still transformer. So uh the model size is uh larger uh compared to comnets. Uh but we are working on like making that not a problem. Um so for example like even though that we have more parameters not all of the parameters are used during the inference um and so on. Uh so so there’s that. Um what is uh maybe one more difference that I already covered but I think it’s very important. Um is how we train. So we use pre-trained backbones like uh Dino uh as a kind of like you know starting point for the model whereas connets just learn from scratch learn from data sets like imageet or koko that means that they only seem those images that are in imageet or koko if you ever look inside koko data set you would be very surprised about the quality of data you know it’s a data set that was released pretty far away in time it was largely crowdsourced. So it is not like enterprise quality and um there are only 80 classes. So the model only learned what was there versus uh we use Dino V2 a model that was trained on massive masses massive data sets very broad data sets. So the model knows a lot about uh the whole world and that uh ends up be very useful when you fine-tune the model because we we just access information that is already in the in the model uh during training. So as a result, for example, when you fine-tune uh RDTR and custom data sets that are like aerial data sets or marine data sets or uh medical data sets, um your RDTR model will score usually several map points higher than Koko. Um yeah, there are probably a lot more differences, but I think those are the most important ones.
Yeah, thanks for that. Uh, great thorough answer. Um, got a couple more questions and a little bit more time to answer them. Uh, Stefan should stay longer. Usually when there are questions that means that it’s a good sign. So, you know. Yeah. Um, uh, Stefan Schneider on LinkedIn asks, “Since RF DTOR is transformer based, would it benefit from things like visual primitives or is that only a thing for the training stage?” Uh, oh, I think I think that that is the the question that when I’m I’m tapping out um I’m happy to because I know that Stefan uh follows me on on uh LinkedIn. I’m happy to ask this question to our research team and uh give him back the answer. It’s probably above my pay grade. Sorry guys. Understood. That means it’s a good question. Thanks for that one. It’s a good question probably. Yes, but uh not for the open source is probably understood. Understood. Uh we’ve got one from Eric Feno who asks this uncertainty around key points. Is it just standard deviation of the heat map or something else? Yeah. So importantly, there were a key points model in the past that used this kind of like you know uh like a like a heat map to describe uh how certain the those model were uh where the point was like probably notable um one was uh Vos. Um but if you would visualize those heat maps that they uh generated um every point has the same heat map. Uh at least that was the case with uh beat post. So it was not very informative. The only like information you would get is that you know the further away in any direction from that point you would get the least certain about presence of the actual anchor. the model was in our case it’s all learned uh it’s it’s a uh it varies with direction so we also provide you information about how how certain we are about the point being in direction X versus direction Y um it differs between the points so like I said our like a spatial confidence is different for example for eyes than for hips um that wasn’t the case with VO. VO was equally certain or uncertain about uh locations of points um regardless of of the distribution of those points in the data set for example. Um so yeah there there are differences there are differences like I said and we actually don’t predict heat maps we we actually predict like additional parameters and we just then represent uh um them visually as ellipses but um we we don’t like generate heat maps internally. Got it. Thank you for that and thanks for the question Eric. We appreciate it. Uh we’ve got another question here from YouTube. You mentioned that this was not trained with augmentations. Is that possible with Rooflow? Is it possible to apply adversarial training in Rooflow? So uh absolutely I mean you can so when you train RFDTR both in the open source repo and in the pro like product version of the model you can apply augmentations. Uh so question is in in some way important because that wasn’t the case initially in the open source repo. Initially in open source repo we we were like uh only releasing with like the barebone model with no augmentation pipeline. Right now we allow you to perform all popular augmentations available and no augmentations I believe. though like one of those like open source very popular libraries for augmentations. Uh so you can add it. You can add it and it will probably give you some boost. Uh but I can tell you like if you would if you would take Yola model and take our augmentations it will probably drop by 20 map points or something like this. Uh if you add augmentations on top of um RFDTR, you might expect probably like a a percent of map uh boost. So it’s not very crazy. Um but it will of course increase your training time. So if you if you care like primarily about getting accuracy boost and uh less about how long it will train or how much GPU it will burn then you can add it both in open source and product. All right, thank you for that. Um good to know. I I feel like I learn something every time we have Rooflow on the show. Um, we’ve got love love for cricket asks, uh, is this still 2D key points or is there a way to convert to 3D? It’s a 2D key. It’s a 2D key point. We we I I spoke even with the with the ML team. Uh, yeah, for now we we are focusing on there’s like not no immediate uh like direction that we have. Got it. Understood. Um there’s one longer one here from uh Simon who asks uh in the preview that you released, can we retrain the model or are the weights frozen? Um also when fine-tuning the model, what data do we need to provide to the model? Do we need to provide sigas together with key points? Um etc. He was thinking about fine-tuning the model uh for the use case you mentioned uh soccer fields. Yeah. So, uh yeah. Uh so I’m I’m just laughing because I’m I’m I’m literally training model uh for that use case right now because I uh I’m thinking maybe about using that for the tutorial. Uh but long story short, yes, you can uh use this model for fine-tuning. So uh the preview model in the product allows you to fine-tune. the preview model and the open source model um repo allows you to fine-tune. Uh what do you need to have? Uh I I’m not sure I’m pretty sure we um support like Cocoa uh data set for key points. I’m not sure if we support YOLO uh data sets for key points. Um but we would probably you probably need to take a look into the repo, but one one of those formats we support. Um and yet you don’t need to provide any sigas. All of that uh is learned from the data set. Uh so all we need is is a pretty standard representation of keyoint data set. Um and that’s it. Uh although uh bear in mind that was the case with a preview version of a segmentation model. uh it might be the case that uh you won’t be a like that your preview model that you fine-tuned and you got from the preview version of key point then you would need to use that preview architecture all the way and when we do the actual release we will actually release like four new completely brand new checkpoints so the preview model won’t be the X or the L from our final release it will be just like a sep separate completely separate model uh and probably might be the case it will be deprecated uh uh right at the release time. So it’s like you know keep that in mind but we still support that segmentation model is still supported in the actual package uh even the preview version uh but I don’t know there might be some features in the ultimate uh model that we released that that are not in preview I don’t know but it’s totally trainable even right now actually I would super encourage everybody to do that because uh any anything that you notice can help us to to make better. Um, so yeah, do it. Let us know how it went. Yeah, please do. And uh if you try something out from uh RoboFlow today from the show, uh tag OpenCV on on wherever you post it on LinkedIn or wherever and we’ll uh we’ll boost your post for sure. We’ve got one more question here which is from uh Ner Nurmmitic. Will this model be a good choice? not just for object detection, segmentation and classification tasks but also for anticipation of events in long- form videos as well. So important thing to mention we don’t support classification task and I don’t think we will uh do uh I also had a conversation with research team and we just think that there’s no point on creating another classification architecture uh models like restnets or or other models like this are already very good at classification uh so we support object detection segmentation key points uh just just a comment on on that part. And uh when it comes to like a long form um sorry like a you know anticipating events or detecting events in long videos, no I I don’t think so. Uh I mean you can build pipelines um around tasks like this but I think that uh that particular task will be better solved with um VLMs. It highly depends on you know what’s your compute capacity if you want to deploy this kind of like action or event detector um on the edge maybe in that case it it makes sense to have like a very lightweight model like RDTR that can do the heavy lifting and a little bit like a thin thin layer of logic on top of that. um maybe that’s the case but I think like long term um that task will be uh handled by VLMs and and we are actually very much interested about that in Roboflow uh spoiler alert we we we are building like a leaderboard of VLMs uh to measure how well they are uh handling task like event detection uh in videos that’s probably coming uh in the very near future. So um exciting. Yeah. Yeah. Thanks for that. We also just got one more question coming in at the wire here from uh YouTube which is generally transformers are used for LLMs and vision language models. Is RF deter vision only? Why is it benchmarked against CNN based YOLO instead of VLMs like Quenv, Gemma, etc.? Interesting question. Okay, very interesting question. So actually um yeah there there were detection transformers before uh RFDTR. Um
so uh that let’s let’s let’s let’s say it uh uh right away um the qu the the the reason is scale and speed and also the behavior of the model. So, LLMs or VLMs those are outer aggressive models that means that they generate tokens uh and those tokens then can be uh converted into words or coordinates for bounding boxes. Uh but that is not the behavior of RFDR. RFDR is not out not an auto reggressive model. Um so first of all that’s why we don’t put them in the same category. the all of those models use attention as part of the architecture but there is a certain distinction in how the model behave and how the prediction looks like and um RLMs and VLMs are just in completely different category um because they are reaggressive and also highly impacts their speed so uh most of those models I know that there are right now a little bit different VLMs can do it slightly differently. But most of those models predict one token at a time, which means that even if you would like to detect a single bounding box, you actually need to have like a multiple forward passes to to get the coordinates and the confidence. So we detect all of the bounding boxes all at once just like YOLO models do. uh and why we benchmark RFDTR against uh YOLO models is is because of the speed and accuracy ratio. Yeah. So those are real time uh detection models. So those are the models that you would deploy and expect to get tens or hundreds of FPS uh per second from both Yoro models and from our or detection transformers models. Um so I I think that that summarizes it well. It’s like completely different behavior. The only common part is they they use attention. Uh so that’s why we separate uh them between VLMs and LLMs and why we put them in the same bucket as yellow is because the speed and accuracy uh because they are kind of similar clearly uh targeting the same use cases. So that’s why yeah that totally makes sense to me. Um thank you for the thorough explanation. So that’s all of our questions here. I’m going to take a a brief moment to uh do a quick thank you to all of our sponsors on GitHub. One of the ways that you can support OpenCV is by sponsoring us on GitHub. You can go to github.com/sponsors/opencv and become a sponsor for just n bucks a month. We’ve got 17 of y’all on there right now. And I’m going to thank you individually. Now, we’ve got uh Zashon, we’ve got Alberta Beef. Great, great username. The homies at Rooflow, Bears, Dev, OpenCV bronze members as well, Axel T81, DJ Greenwood, Tac Tealoski, Techman, Dan Dagaru, we’ve got uh Chonx Fres Hugh. I did my best on that one. Please tell me how to actually pronounce that if you’re watching. We’ve got Stefan Sarnv, Luxronic AI, Alexander Voronov, Big Vision LLC, IPOP AI, Comet ML, Alexander Ismolof, and Tala Hussein. Thanks to all of you so very much for your support. You can be as cool as those people and get your names read out here on OpenCV Live by becoming an OpenCV sponsor on GitHub. That’s github.com/sponsors/opencv or just go to the GitHub repository of OpenCV. Click the little heart icon and it’ll take you to the same page. Thanks so much everybody. We really really appreciate you. You are each individually the absolute best. And Peter, you’re also the best. Uh do you have any final thoughts for the audience here before we call it a day? Uh yeah, I’m not sure if I’m the best, but I’ll take the compliments and say thank you. Uh yeah, final thoughts. I mean guys, use use RFDTR. Uh let us know if it works for you. Let us know if anything breaks. Open uh issues on GitHub if anything is wrong or you have ideas on how to improve that. And yeah, let’s let’s keep it rolling. Let’s let’s make this uh open source Apache to license detector as good as possible uh together. So yeah, that’s that’s it. Thanks so much, dude. Um, learned a lot today. We hope you learned a lot today out there as well. One more way you can support OpenCV, become a member of the OpenCV YouTube channel. You can find out how to do that by scanning the QR code or just clicking the button if you’re watching us on YouTube. That’s one of the ways you can support this essential nonprofit open-source software. We will see you next week. Um, same bat time, same bat channel, 9:00 am Pacific time with our guest, Dennis Baldwin of Drone Blocks.
Excuse me there. Our guest, Dennis Baldwin of Drone Blocks. Drone Blocks is making some really cool stuff for drones. We’re excited to be getting back to talking about some more robotics, especially autonomous vehicles. We hope you’ll join us for that as always, right here on Twitch, on YouTube, on LinkedIn, and on Zoom. Until then, take care of yourselves out there, folks. Take care of somebody else if you can. Use OpenCV5 and have a great day wherever you may be. Adios.