Please in the chat wherever you’re watching on Zoom, Twitch, LinkedIn Live, YouTube, or even for some reason on Facebook, let us know where you are joining from. We love to see the global OpenCV audience chiming in every Thursday morning here on OpenCV Live. I am coming to you today from beautiful 21. Um, and uh, Doc, where you coming from? San Diego, California. And it’s getting the usual places. The usual places.
I am fixing our Zoom feed real quick. Um, it seems like Okay.
Yes, that was the problem. Uh, sorry Z. Oh, I can’t apologize to them cuz they’re not involved. They’re not in here yet. But all right, Zoom is now live. We should be good everywhere. Um, I know this is super exciting to watch a guy push buttons you can’t see, but uh we do have an exciting episode for you today. We’ve got our very own Dr. Sat will be giving us a tutorial on the Quinn 2.5 VL multimodal language model from Alibaba Cloud. I think a lot of people out there that aren’t uh clued super clued in to the uh computer vision scene uh as it were may not realize that uh Alibaba has such a massive um uh AI infrastructure within the company at this point. Um, I think they know about them at all. Most out there know that the even cheaper version of Amazon basically
like knockoff electronics components. They know they’re they’re a massive company and they have and the thing Yeah. And the thing is that they are also, you know, the solutions they have chosen the open source path. So the weights are open source. It’s really uh you know they’re doing a very good job and uh one of their pet peeves is that you know our models are so good but somehow people don’t uh they they don’t reference us in papers right everybody outside the papers right everybody knows that the models are really good for some reason they they don’t get mentioned right so it’s one of those things that um sometimes it is just it is just sad that people don’t the attention that they they deserve. So we will we will give Quen some attention.
We know all about that here at OpenCV. Something we discuss quite frequently. Um Zoom looks like it’s up everybody on Zoom. Sorry about the late start here today. I am uh not in my wizard tower floating high above Selma in San Francisco, California, but in instead in Tana, Mexico um the happiest place on earth as Crusty the Clown famously called it. And it is that for me I must say every time I come here something else impresses me about it. Um especially the food the food in Tana is fantastic. If you ever get the chance um definitely cross the border and have a visit. It’s um uh a really interesting uh place with awesome food and and extremely cool people. Um but yeah, Zoomers, uh let me know. Let us know where you’re joining from because you missed my little early preamble there. Looks like we’ve got some folks coming in from Nigeria. We’ve got uh Baja. Hey, how you Baja? Uh UAE chiming in here. Um so yeah, let us know where you’re joining from. I know we started a little bit late and we’re also having maybe some small video issues, but it looks like we are in fact live everywhere and so we’ll uh uh get started here. I think Doc, if you’re ready to go, um let’s go ahead and do it. Uh do you want to talk a little bit about OpenCVU at the top? So uh let let me officially start it. Right. Hello everybody. Welcome to OpenCV live. In today’s episode, we are going to discuss Quen 2.5VL. This is a large visual language model where we will discuss what are the capabilities of these model and how to do image captioning, object detection using this model. Very powerful model as you will see. But before we uh start, we also have with us Phil Nelson who is the director of content and creative at OpenCV. He produces the show. If anything goes wrong, it is his fault. Hi Phil.
Good morning out there everybody. Yes, it’s me. It’s me P I P. I am the co-host with the co-host the second of the second. I am also your plus one and only. It’s Mr. Nelson if you’re nasty but you my dear friends can call me Phil. And I’m here to remind you of a few things that we do on every single episode of this year program. The first of which is a giveaway to you out there in the audience. Stay tuned later on in the episode. I will be asking a trivia question based on the presentation by our illustrious host Dr. Satia Malik. And the very first person to answer that question correctly will win the Open CV University course of their choosing. We’ve got a little bit more to say about OpenCV University in a few minutes. But we’re also taking questions from you in the audience. Use the Q&A button wherever you’re watching in the chat. Just type your question or if you’re on Zoom, hit that Q&A button. save those questions for us in about 40 minutes or so. We’ll do Q&A uh afterwards uh after we do our trivia segment. So stay tuned. We like to alert people to pay attention. Uh if you want that course, watch the episode. Remember what the doc says and you too could win. Um but to start us off here, let’s briefly discuss. We’ve got some uh some some big open CVU news here. I’ll bring the screen share up. Yeah. So the sale has started. We started early. That’s why it’s called early bird special. It is this early bird special is going to end uh tonight. So if you’re interested in OpenCV courses, this is the absolute best time to start. So uh the the thing that I recommend is uh you know it’s a collection. If you’re really serious about a career in AI, then uh the best option uh available for computer vision and AI is our computer vision and deep learning master program. So this is a collection of six courses and it basically has all the courses that we offer and there are two courses out of these six courses there are two courses that are super important. It is fundamentals of computer vision and image processing. This is a Python based course and we also have two deep learning courses. You can do u any one of them deep learning with PyTorch or deep learning with TensorFlow and KAS. And you know these are serious courses, right? These are not like uh your normal courses where you go and uh finish it in 7 days, right? These courses will take you uh solid like each course will take you about 3 months to complete. Uh and but but after you done that, you have a very solid foundation of what uh computer vision is and also about deep learning, right? So 3 and 1/2 months for computer vision course, 3 and 1/2 months for the deep learning course, that’s about 7 months and maybe uh 1 month for uh you know uh just just as a a cushion there. So in 8 months you have a very good shot of calling yourself um an AI expert. In fact, I can say without hesitation that this course is better than 99.9% of the courses out there, including top university courses. So, uh so you know this is a very good deal also right now if you look at uh the pricing 40% off that goes away tonight. So, uh if you’re interested, please uh go and check it out. This is for serious people, right? If you just want to dabble with in in computer vision then uh this may not be the right choice for you but if you are serious about getting expertise in computer vision and AI this is the absolute best program for you. uh in addition to this right so people who go and purchase this uh program during this webinar uh and and you can let me know in the chat section you purchase the webinar uh just you know give a screenshot or something uh or or just send us an email at coursesopencv.org board, I will send you a special gift, right? This is a webinar only special and this is basically a book by Frans. This uh of $48 value and uh this is this book is called deep learning with Python and we’ll give you this book for free, right? Uh but you have to purchase this is not included in this um in this webinar uh in this sale Labor Day sale but I have this very special webinar offer for you. If you uh if you purchase it during this webinar before this webinar is over and send send us an email at courses openc.org you will get that book included in this uh in this program. All right so uh let’s get started. Uh, we have a very exciting uh Oh, actually, sweet doc. Yeah. Do you have other things to cover before we get started? No, I think we should just get going here. I’m going to uh put take myself off cam and uh turn the show over to you, Doc. All right. So, uh let me sh uh let me share my screen. How do I share my screen?
Okay, right here. share screen and we are going to start with.
Can you guys see my screen?
Uh, can people see my screen?
Um, yeah, you’re all good, doc. Sorry about that. I was I was muted out for a sec. I’m having some AV difficulties here, but go for it. All right, no problem. So, okay. So, Quen 2.5VL. First of all, what is a vision language model? Right. Now, um a vision language model is basically uh a a model, right? We we some of you may know or most of you probably know what is a large language model. A large language model is an AI uh model that has been trained on internet scale uh data, right? So we basically took all the all the uh data from the internet uh all the text data from the internet and created a single AI, right? AI model that is so large uh that it it basically encodes all the information all the world information and you’re able to answer um talk to it, right? So uh that’s a large language model but of course we know that uh you know text can go only so far we also need to include images and videos and that’s where vision language model comes in. So you know if you look at traditional computer vision uh algorithms or neural networks like image classification let’s say you want uh you want to build an image classification model you give it an input and the neural network basically would tell you what the output is right it could in a standard classification problem the input is a dog and or input is an image and the output is a class label right and you have this neural network that you train and for training this neural network. You take several examples of dogs, cats, horses, whatever you want, whatever classes you want and uh you basically train the neural network. You have the ground truth also. Uh you also know which image contains uh what, right? So image of a dog, you have the label dog, image of a cat, you have the lab label cat and so on and so forth. And then you train uh a uh you know a neural network uh usually an image class uh class classification network like let’s say ResNet 50 and you it will be able to tell you information about those classes only right so let’s say you had 50 classes in your training set you would be able to get 50 classes you know you’ll be if if there is an image of that class it will be able to classify that okay this is a cat dog etc but this is a very limited way uh because it doesn’t know anything outside that class, right? And just imagine um if you give it let’s say there were no elephants in the class and you gave it u image of an elephant, you it would either classify it as one of those uh other uh 50 classes or you could have a catchall, you know, uh unknown class and that that’s the best it can do, right? it can cannot automatically uh figure out that it’s an elephant. However, if you had trained a visual language model, right, you u and a lot of these models are trained on image caption pairs. So they go on the internet, they find out, okay, this is an image, there’s a caption. So this vision language model because the language part is there, it gets world knowledge from the language part. It is able to infer what is what an elephant is. Even though uh you had never explicitly trained it to recognize elephants, the vision language model would be able to recognize elephants because it has been trained on internet scale uh image caption pairs. Right? So that’s the basic difference between u VLMs and uh traditional uh computer vision techniques like uh CNN’s. Now uh using CNN’s right the normal image classification uh techniques uh both worlds right so vision systems when we uh when we analyze an image using let’s say a convolutional neural network we get features which learn which know a lot about that particular image right so we learn those features and that’s the vision part so you are able to very closely analyze what is inside this image but at the same time the language part because it has been trained on world knowledge from the internet uh you’re combining that world knowledge with local information so the vision part is getting local features and this LLM is basically helping you uh get global context so you are able to do so many things which were completely it’s it almost looks like science fiction now right one uh big difference between traditional methods, right? Uh or it’s it’s funny I’m calling them traditional methods now. Uh but they are pretty new techniques, right? U so let’s say image traditional methods of of so long ago like 5 years. So u now think about image classification when we say image classification uh the input and the output they are constant right and only the neural network architecture changes. When I say that ResNet 50 is an image classification network, you would immediately know what is the input and what is the output. Okay. Similarly, when when I say um okay, we train an object detector, you know that the input is an image and the output is a collection of bounding boxes with labels. So there is no ambiguity in what an object detector does uh or what an image classification network does uh or a segmentation model does. the input and output. As soon as I say that I have an object detected, you are very sure what it actually means, right? But that’s not the case with VLMs. The VLMs, they could mean many different things, right? So, um for example, uh a few weeks back we had done um uh uh we we we had done an episode on uh on clip. So the architecture of clip the inputs that were given to clip the output clip gives is completely different from quen 2.5 VL they are completely except for the word VLM that they are both VLMs they are both visual language models there is no similarity between the two uh right the so this is something that you have to uh bear in mind that this is not like when when people talk about VLMs they can actually talk about many different things right. The only thing constant among them is that the input is um image plus uh you know sometimes even the uh input is not um is not the same. But if you look at modern uh you know the more capable VLMs usually you have an input which is image plus text or video plus text and the output is usually text right it can also be an image but for simplicity we can just keep it text right and so today we are going to talk about uh quen 2.5 VL uh now even though in this diagram I say that okay we can have image and uh text usually Everything is encoded in a prompt, right? So the prompt encodes the image, the text, the video, whatever you want to give it, it is encoded in a prompt and then we have the uh visual language model in between and the output is text. So we’ll show you the exact syntax how it is done um in uh using Jupyter notebooks. But let’s first look at the capability, right? What is this visual language model capable of? Uh what you’re looking at this input, right? This input has four images in the same image and you can ask quen 2.5VL what are these attractions right we are not giving one attraction at a time this is a complex image it has four uh different images uh embedded in it and u we are saying that please give the names in Chinese and English okay and if you do this it will output this kind of thing it would say the great pyramid of Giza and uh you know uh words in Chinese which I’m assuming are the great pyramids of Giza and uh similarly right uh it would say the top right uh we have the great wall of China bottom left the statue of liberty top right uh the terraota army now think about it we did not there was no training involved the model is so capable that you give this complex image it figures out the structure the layout of the image that it’s not a single image but a collection of four images And then it is able to say uh you know what are the various things right what are the various attracts. So this is a very powerful capability which uh you know just a few few years back uh it was it was not possible. In fact a year or two back it was not possible. Um now you may be thinking that oh I’m going to use VLMs for everything now because why would I use uh tradition uh why would I use image classification or object detector when I can just use throw a VLM at it. The problem is these are very heavy models, right? You cannot use them uh you can use some of these models on the edge, right? But they require uh they require compute and things that can be done. So don’t use a very big hammer to solve a small problem, right? Use uh the right tool uh for the right task. So there are things where you just don’t have enough data. If you have no data at all, then start with a VLM. And uh if you have data and you have to you have to solve a very specific task for example if you are uh if you’re building something for a manufacturing u inspection in a manufacturing plant then uh you know the kinds of defects right um and the VLM may not have that level of knowledge right that okay this is a defect because who knows right that manu that manufactured plant only you know that a defect looks like this and you may not have many examples of that defect, right? So, you are much better off training um in an an image classification uh network and solving that problem. So, use the right hammer for the task, right? The right tool for the task. Don’t uh you know uh because these things would be very expensive to run if you uh actually use them. Okay, so here’s another example uh what uh Quen 2.5VL is capable of doing. you’re looking at um a picture okay and you’re asking the VLM detect all motorcyclists in the image and return their locations in the form of the coordinates and then we are giving it the format right we want bounding boxes 2D this is the format it has a label and it has a sublabel the label is motorcyclist the sublabel is wearing helmet uh or not or not wearing helmet okay and if you run it on this particular image you will see this output right. So it has you can see that all the all the uh all the motorcyclists have been identified and um and it also you know the the coordinates have been identified and it says not wearing helmet, not wearing helmet, not wearing helmet and then wearing helmet. You would expect you know uh in in a street scene most of them would be wearing helmets and then there would be one which would be not wearing helmet but um I’m guessing that this is a scene from Asia. In fact when I was growing up they would um nobody nobody wore helmet uh while riding bikes. Um so it’s it’s just funny that you would not see this kind of scene in the in the US. Um but yes uh it it did find all the classes as well as uh the subclasses right and now you can use this model. Uh important thing here is you never trained this model on uh on helmet versus non not helmet. You never trained it explicitly to detect uh bikes. Right? So it has figured out by uh you know by the training process that these kinds of things are helmets, these kinds of things are bikes and it also knows how to localize uh this information. Right? So it is you know fascinating what it is able to do. It is almost like science fiction. One uh one thing I would like to point out is that some of these examples have been run uh the quen 2.5VL comes in many different flavors. So the biggest model uh was used to produce these examples. So sometimes when you use a small model which is which you can do on Google Collab, you may not get the same quality results because you know it’s a small model not as big uh for which these results were generated. U it also understands because of the language part it has um it has a sense of what uh the social structure is right. There are a lot of uh you know when when we uh as humans when we interact with other people there are a lot of you know social information that we absorb from the environment uh which which doesn’t need to be explicitly taught to us right the same thing happens with because of the scale at which these things are operating they understand uh the context right the social context also for example in this case uh you are given an image and it is said you know you you ask it what is which person is as a uh you know acting bravely in this case. There’s a lot going on in this image, right? It needs to know that uh first of all, it needs to completely understand that okay, there is a knife, there is a person, he’s threatening a woman and there is another person who is trying to save, right? But not only that, it also uh needs to know this the idea what bravery is, right? What courage is, it needs to have absorbed that idea as well, right? It is not just that you can locate all these things, right? That is that is a mechanical process, right? that is um that is doable right you can understand that but to identify that in this context this is the person who is acting bravely that is quite something that has it gives you a glimpse of intelligence right not AGI or anything but still a glimpse of intelligence however it makes this decision you may see say that oh it’s just doing some calculation yes our brains are also doing similar calculation but you can see that there is a glimpse of uh intelligence here. Okay. And uh another uh task which it is very good at in fact in our consulting we use uh quen 2.5 VL for u for document when whenever there’s a complex document parsing um and uh we have to use VLMs in that case u 2.5 VLM is our VLM of choice. So it’s very good at document parsing. You can in fact you know in this example you can uh say that okay convert this document into HTML format right that’s something but you can also say give me the summary of figure one right or what does figure 1 say right or where is this figure which does this right so you can have very complex uh document analysis that this thing does and it can also structure your document uh PDF especially it can structure your PDF extract information from the PDF very nicely. So if you’re thinking about building uh a document parser uh this may be uh a very good option. So um so this this is very interesting and then we uh it can also analyze videos. Okay. So this is a video u which uh where there you know they they want to localize particular events. So what’s happening? You can basically say uh put in uh this video and say that localize a series of active activity events in the video. Output this uh start and the time stamp for each um for each right. So let me show you. I think I’ll have to restart my sharing because it is on a different um uh it is on a go for right. We’ll make it work. Yeah. One second.
Let me share my entire screen instead of just Mhm. entire screen. I will share screen two. Share. Okay. So you can see that now, right? So in this video, right? After you give this prompt, it goes and says that okay, yep. At start time 21 seconds, a person removes a piece of meat from its packaging and cuts off the fat. So, okay, let me go to the video. Where is the video? Yeah, right here. So, let’s go to 21 seconds.
So, you can see at about 21 seconds, uh, let’s let’s start again. Sorry. So the person is removing 19 seconds the person is removing 21 seconds they removed it put it here and then they start cutting the meat right and the fat uh for my vegetarian friends uh I’m sorry u and for my uh non-vegetarian friends who are salivating I’m sorry as well uh but yeah so uh so basically it is able to identify uh various events in uh in the video, right? And it says that uh at at location 50, the person is seasoning the meat with salt and pepper. And if you go and look at about 50, let’s go here. So, it’s not actually salt and pepper. It’s uh oregano or some other u you can forgive it, right? At least it is a kind of spice that they are putting it on the on the on the meat. So these are the kinds of very sophisticated long video analysis that you can do with Quen. You don’t you don’t do much cooking, do you doc? Why is it salt and pepper?
Yeah, I’m just just wondering. Just wondering. Well, because I I’m not using the uh the correct language of cooking.
All right. So, uh video analysis, right? So it it does a very uh excellent video analysis. Now bear in mind that these results uh you will get this quality result if you’re using the biggest model right as the size of the model comes down you should pair down your expectation also. Okay um they uh it can also do sorry okay and it can also do agent action. Now in the context of VLMs uh agent usually means when you want to do something uh when you want to control your laptop screen or your mobile screen using uh using the outputs right you basically give it you take a screenshot of your laptop right um or your or your mobile uh screenshot give it to the VLM it will automatically figure out what the layout is and what you know what are the various ious apps you have and what are the various things going on and let’s say you start a browser it will know that oh you have you are in a browser just by looking at that image and it will be able to do the next task so you can say that oh make a reservation for me um at my favorite restaurant right whatever the restaurant is u so it will go uh to the browser it’s going to uh start the browser and then look for the favorite restaurant go near your location etc etc so that kind of thing is called agent action and uh this is able to do that as well. So uh okay so that’s let’s look at the usage right how do you actually use the VLM the quen 2.5 VL it is u you interact with it using a chat interface right the good thing is that you can chat with it several times you can give it the context um there are uh the chat looks like this right hopefully most people are familiar with uh JSON format right JSON is you know uh text format where uh you can you can pass structured data uh using this format. So here we first uh you know the chat interface looks something like this. Uh it is called chat.comp completion and we’ll see uh very specific examples. Uh you have the model you say okay I want this particular model and this model comes in many different flavors. This is a 7 billion model. You can also have the three billion uh parameter model and so on and so forth and you know based on how many parameters it has the quality will be different. Then we this is the crux right the main thing is the message that you’re going to pass to uh to Quen. There are two kinds uh of messages actually there are three kinds of messages but uh let’s let’s for for this we can just focus on uh one or two. So the first one is called the system right the system prompt is uh telling it at a high level what it is right it’s giving it an identity that you are an a helpful assistant right um and this is the kind of me the kinds of things you put in the system prompt is something that you would ask again and again. Okay. So, uh for example, when we do object detection, the system prompt could be you are uh you are an object detector that returns the response in a JSON format and then you can uh you can you can say what the format is the bounding box format etc. So once you set these things in the system prompt you don’t have to go and change uh say that again and again in the user prompt right. So user prompt is usually reserved for things that you want to get done immediately. Right? So the system gives it an identity and the user prompt is telling you know it is requesting that I want uh something done. So here you can see that the user prompt we are passing in an image URL and there are different types also the content that uh gets passed uh it can be an array of different things and you can you can this array could be many different sizes right you can pass in one image two images multiple images in any uh any sequence you want okay uh but uh when you’re passing image you can pass it an image URL and then you can also pass in a text. Okay. U so here you can say that okay this is the URL and all I want to know is what’s inside the image. So what is the text in this image? So you can send like that. You can see how simple the user interface is right using just this prompt right uh you can you can uh make it work. No other knowledge is required. So system as I had mentioned that it sets global behavior right and uh you you say things like you are an assistant that answers briefly returns currency in USD right let’s say you are doing receipt analysis right you got a receipt and uh you want to uh even if it is uh let’s say um a Japanese receipt you want the output to be in USD format right uh dollar format you can tell it in the beginning and it knows right then the system prompt sets the stage it gives it an identity and then you can say that users and uses JSON where a user asks for structured data. So um here’s another example of user right I already mentioned what it does so we’ll we’ll just uh skip over it but in case of the receipt example you can pass in the receipt and say how much did the latte cost give the answer as JSON this is not required because we have already put it in uh the uh system prompt uh but it doesn’t hurt if you’re asking the same thing uh again and it will give you the answer even if the receipt is Japanese. It is going to give you the answer in USD. Right? So very simple, right? You don’t you didn’t train the model. You didn’t do anything. This you took this model and uh just passed a prompt and it is give you able to give you an answer. And you can do this right today with uh you know uh on on your laptop uh as long as it has u enough uh GPU. I mean it will still work but it may be slow if you do not have a GPU but you can definitely do it in Google Collab and I’ll show you uh how it is done. Okay. So model card maybe maybe send the prompt to go go make a sandwich go get a some coffee or something when you come back it’ll be done right. Uh so uh this is you know a 3.5 3 billion parameter the uh model that we are going to show in our demo. It is a three billion parameter model and it is multimodal. Whenever people talk about multimodal in uh in in AI models it just means that it is capable of taking more than text right so text plus image or video that’s multimodal as well right and if it sometimes uh you know more sophisticated models would take in speech as well. So that’s uh truly multimodal uh produced by Alibaba uh cloud team which is also called the Quen team. The quen uh license uh you have to be a little careful about it is uh you know it’s a research license for non-commercial uh purposes u and the 3 billion and the 72 billion uh sizes they are not Apache 2 right but if you want to research it you want to um you can use it for a lot of things right uh when when you’re building a model to understand the model so that’s that’s good all Right. Uh and the weights are open source, right? So which means that you can actually use it in your uh notebooks etc. Now let’s start with image captioning. You you guys are able to see this um screen, right?
Um it should be showing a notebook now. See the we see the uh Google collab notebook. All right. Perfect. So let’s uh start with one example, right? We will start with um let me make sure that I I have finished. Yeah. Oh no, I had not finished the Okay, sorry. Uh let’s let’s keep going. Um one of the things that people may be thinking whether whether it will fit in your uh in your GPU or not. So quen 2.5 VL 3 billion right this is going to take about 3.5 to 7 billion um you know uh sorry 3.5 to 7GB which means that an RTX 3050 should be enough and you know the slides will show you um all the different categories that it will you know if you’re using a bigger model you basically need a GPU with a bigger memory. uh this these slides and everything right all this information is in our uh free VLM boot camp as well right so these slides are there you can register for our free VLM boot camp just search for opencv VLM boot camp and you will uh you will find uh this all this uh you know you can register for the free course and you should be able to uh start it or you can scan the QR code that’s up on your screen right now if you’re watching this show live and find it on the OpenCV universe city homepage navigation.
Yeah. Yeah. In the interest of time, actually I’m uh let me just uh skip some of this very quickly. It can do image captioning, object detection, we’ll see this as well. Universal recognition, which we saw an example of again, object detection, different type. So, uh we have pretty much covered this. Okay. Um okay. So, let’s uh let’s go look at image captioning. Here we are going to use a Google Collab and I’m going to use a 2.5 uh you know uh so if you’re using Google Collab for best results right you can go here and check what is the runtime you’re using so the runtime if you’re using uh collab pro plus or collab pro you will notice that uh you have a GPU option right A100 is the most powerful GPU GPU here uh followed Why I’m guessing I uh I always get confused between the two. I think u the T4 is the second powerful and L4 is the uh the least one but you need to use you should use a GPU if you have access to it otherwise things will be slower right so you can uh go and use this runtime. Uh another thing about collabs or any notebook in general is usually you know many people don’t use this feature. There is uh this table of content if the notebook is prepared properly which in our case uh we take great care to prepare the notebook. Uh you will see this table of content uh on the side which makes it very easy to go and uh navigate the notebook. All right. So the very first thing we need to do is install um Quenv utilities. Right. So we pip install this and uh we should be uh good. Uh one other thing is that 2.5 VL it’s going to fit in uh it it consumes about 6 GB of RAM in float 16 format and uh I’ll show you how uh to use it in float 16 format as well. Uh so you know most GPUs 3090 and above should be able to handle this. We start with uh standard imports. uh not going over these. Most of the people on this webinar should be a uh should should know most of these right. Uh the only thing of interest which can be new to you is uh we are importing right the transformers. So transformers is this library by hugging face which is an excellent uh package for Python package which has access you know using this you have access to pretty much all the you know large language models regular models all kinds of models uh in this in this package uh and it’s very nicely done so that you don’t have to do repetitive tasks again and again u all the things all the gotcha things where uh you suppose you did not do the right pre-processing etc is taken care for you. It does under the hood uh it does a very good job. So from there we are going to get uh this uh quen uh 2.5 VL conditional generation and uh we are you know this is a package we are importing and also autoprocessor. Now autoprocessor basically does um pre-processing. Let’s say you have text. The text you know large language model is not going to take text as input. So you need to tokenize it and then send it to u uh the large language model. Now tokenization as we had covered it in the previous uh previous episode while discussing clip it is basically converting uh text to a numerical representation. A token is roughly speaking roughly speaking it’s a word. Okay. So you can pass uh if you have a string you can convert it to numerical values for pretty much every word you will get one number and that process is called tokenization. Now you don’t have to do it because you have the autoprocessor which will do it for you. All right. And then we have you know standard imaging and visualization. And then we uh this one is the important one. uh we want to download quenv utils and we get the process vision info. So this basically does the post-processing uh uh this basically does the post-processing of the model. The model output may not be uh in the format that you want because um you know so so we can do some post-processing on this to clean up. All right. So the very first thing we need to do is uh we need to set the device. If if a GPU is available then we use the GPU and CUDA basically is u if you it’s specifically Nvidia GPU. CUDA is the library that uh is run on Nvidia GPUs and when whenever we want to refer to Nvidia GPUs we say CUDA right and if CUDA is not available if an Nvidia GPU is not available then use the CPU. Okay. Um and then we uh basically tell the model id this is you know 2.5 e 3 billion uh instruct uh for people who do not know whenever there is uh a model has the name instruct in it means that it is uh it is a chat-like interface right you can interact with the model in uh in a chat format. If the instruct is not there in uh a large language model, it means that it is a raw model, right? It is u most of the time you would not download those models. Those are for researchers who want to use the raw output and not uh this additional instruct model. Okay. Then we set up the model. We extract the model using this uh quen 2.5 for conditional generation dot from pre-trained. So look at this. how easy things are these days, right? You basically can extract this model uh by all the tools available. So you give it the model ID. Not only that, you also say that um I want to automatically figure out um what kind of data type this model should be in. Usually when you get the model that would be in float 32 format, right? But you don’t want uh things in float 32 uh because there are float 16 FP16 which will run much faster on the GPU without much loss in uh accuracy right and when you see auto it knows okay uh FP16 works very well on GPUs but pretty much you know you won’t see much uh much in in CPUs right basically it gets converted if you even if you have an FP16 uh data type for CPUs it gets converted ed to FP32. Uh so so this one does a good job of already downloading the model u in in the right format right in the FP16 model format if you have a GPU uh it also there is something called device map when you have multiple devices or the model is very large it uh first tries to use the GPU and if it cannot fit all the layers in the GPU it can use the CPU etc also so uh you just set it to auto so that it uh does the best job uh with the available hardware then we need the uh uh processor as I said that you need to do some pre-processing and u again you know the the syntax is so easy sometimes I feel that everything has been made so easy for um for the next generation uh how will they know what it takes to write write difficult code uh but anyways so you get the processor uh out and We will use this processor to prep-process the data before sending it to the model. And finally uh when you do model device, it does everything needed. It downloads the model as well as sends it to the device. Right? So this model needs to reside on the GPU. Um and and it does all of that thing. Um and you can see here uh it is downloading. Oops. So it’s uh it’s downloading all these models. uh not this model, sorry, all these files which are part of the same model. And here you can see that um there are you know uh there are two files model 0 uh 01 uhsafe tensors and 02.safe tensors and it is going to download both of them and use it in the load it in the right format. You don’t have to worry about uh any of this uh how how exactly it is done. Okay. Next we are ready to upload an image to this model and do some inference right we are uh we want to do image captioning. So we get an image uh this is we are downloading it from the URL and here we are opening the image and for for people uh this you know if you have programmed in Python this may must be familiar code we basically request basically downloads it’s used to download the URL but this downloaded uh you know data is not in the right format it it doesn’t look like a file system to the uh to the image uh you know uh opener So this io.bbyte uh bytes io basically it makes the data look like a file system as if you’re uploading from the if you’re loading the image from the file system and you can o open the image after you open the image. Shout out to the uh Python request library. One of the most useful libraries I think in in all of Pythonom. Yeah. Yeah. Yeah. That’s that’s absolutely true. Uh yeah. So you can download it in one line of code. Oh, one one other thing that we are doing right here is converting it to RGB format automatically so that um so that you you’re not um you know uh you don’t have you don’t get the image in BGR format and then you have to do it again right um okay so this is the image we have uh downloaded and we want to caption this image now as I me mentioned we have to create a message right and here we don’t even need the system prompt Right? We can just use it’s a simple problem. Uh we don’t even need to provide any system prompt and everything can be done using the user prompt. So message we assign a role to it which is user and the content we want to pass in the image and we want to pass in the text describe this image. That is it right? You uh that’s that’s what a caption is. It’s a description of the image. So we are instructing it to uh describe the image. Now let’s see uh you know um h oh yeah oh so this is sorry I was trying to think why did I write this code twice. This is part of the uh markdown and this is the actual code. Uh so you run this you store this in messages. And now uh now let’s let’s look at what the chat template looks like right u the a chat template is basically we have written everything in JSON format but this format is not something u the model recognizes u it it has a particular chat uh template right and we have to use this uh chat template and the template has things like right when it sees uh in the template it uh in the string it sees image start ro content etc. So you need to send it in this format. Um fortunately we don’t need to uh you know construct this manually. But just to give you an idea uh the JSON format that we created is not the one that would be used uh by the model. So we need to we need to change the format. Uh we still want to input in JSON format because that’s convenient but we want some code that will automatically change it. Right? So you can see here uh we do processor and processor has everything done for you right you don’t have to um you don’t have to worry about it and the the fun part with this proc processor is that when you you basically get it automatically right based on the on the model that you have specified. Uh so when you’re using transformer package you can pretty much use the processor in similar ways without knowing uh exactly how it is doing things right. So it’s very consistently done. So you apply the chat chat template we you pass in the uh message you say tok to tokenize is false and the reason we don’t want to tokenize it right now is in the next I I’ll show you in the next uh you know when we actually send it uh we we’ll turn on tokenization but for now we don’t want to show turn on tokenization I want to see if we if you turn on to tokenization we won’t be able to print these things would all look like numbers okay um add generation prompt right what is the generation prompt that also uh will be added and we are going to print it right. So you can see that image start system you are a helpful assistant. We did not explicitly say it but there is a system prompt which uh it automatically appends it. It knows that okay you did not put in the system prompt. I’m going to put in the system prompt myself. Okay. And then we have uh this next one which says uh you know describe the image. So we are uh sending all this information. So this you know you don’t need to know exactly what is the chat template. Um it may be useful for debugging if you have special characters or something can go wrong. Uh but most of the time you’re working with this kind of a JSON uh message. But it’s good to know that this is the input uh that uh that the model is expecting. Okay. Uh so now we have uh you know the uh basically we are using process vision info. This is a utility that basically if you look at the comment it says it walks through every message. It finds all the instances of image and video and then it applies uh quen visual pre-processing to ensure that the image is in the right format that pilimage uh uh format and then it returns two parallel lists right and uh of image inputs video inputs and then uh also you know u it pre-processes both the text as well as the image oh sorry this one is it pre-processes only the vision part Right. So uh you get image inputs and video inputs. It extracts from the message that you had created. And finally we are ready. Right. Uh we are going to pass in this is the input that is going to go to the model. Uh so we use the uh processor. We get the text prompt. Right? So this one is uh basically the reason for these brackets is that it’s a one element right? It is only one text prompt that we are sending. And then we have the image inputs. We send in the video inputs. Uh padding don’t worry about it. You know, set the padding to true. Right? It basically u yeah let me let me skip this thing. Uh set the padding to true. It is some uh little detail that uh it’s not very important here. And then we have uh return tensor in pt. So transformer it’s a python package and it will work with uh tensorflow as well uh as well. So we have to explicitly tell it to use uh pytorch right. So the tensors that it returns should be in pytorch format. Uh for people who are familiar with tens uh you know uh tensorflow you would know and and pytorch both you would know that the two formats are slightly different right. uh the in in in PyTorch the bat size comes first. The first element is the bat size followed by the number of colors followed by uh the width and height. Uh in TensorFlow the format is slightly different. Uh I cannot remember exactly. I have not used TensorFlow for a while but uh I think the uh it is height width and then channel and then finally the bat size. I cannot I I may be wrong on that. All right. So uh now we want to use okay once you have done this right this input you’re passing it to the model right uh to the device sorry and uh this device we have already set to GPU etc. So we had already passed the we had already transferred our model uh to the device but here we are going to pass the input we are going to uh push the input. Um next we have torch.nograd No grad whenever you’re using in PyTorch whenever we are using some sort of um whenever we want to use inference right we the model we want to make sure that the model is not um not generating all the gradients. So to make things faster this narrat is used uh otherwise it’s going to calculate the gradients and gradient calculation is necessary when you’re doing training but during inference when you’re just using the model that calculation is not necessary and it can unnecessarily slow down the model and uh so basically you uh you wrap it in torch.nograd and say model.generate generate and you pass in the inputs and you also say max new tokens is 64. So this gives you a sense of uh you know uh this caps the output to 64 tokens. Okay. Um so the caption would be you can say roughly 64 words maximum. Now just to give you an idea what these tokens look like, we can do inputs. Uh if you look at these input id zero, this is the token, right? Uh these numbers are basically tokens. Okay? And you will see um so they are they are fixed length tokens which means that after if they have used all the tokens they will start padding it, right? The padding equals true is basically it says that pad uh with with the with the same value which represents the end of the string uh if so that the length is the same. Okay. Um so okay so this this is just the input and now uh we are going to do batch decode right we uh the the output that we get right the generated ids that we get it is not in a format that we can read. So we want to convert this um we want to convert this to a caption that we can actually read. Okay. And uh you know there could be some special tokens that we want to skip. And if you look at this uh after all this thing is done uh and we are printing this this is the output you receive. The image depicts a serene and picture picturesque scene of a white dog sitting on a stone pathway near a stunning lake. The lake has crystal clear you know uh sounds very reasonable. So you can see uh very few lines of code and you are able to do something as sophisticated as image captioning. I was planning to go over the object detection notebook as well but I think we are uh we are running um running short short on time so we will skip that. But if you register for this course, you will be able to uh see the object detection notebook also. We cover that in the free course. So uh just do a Google search on u on opencv.org. Uh Google search on opencvl boot camp and you should be able to get it. Maybe if people really want it, we’ll do a whole we’ll do a whole episode on that. Uh yeah, that’s another option. are really interested, send it send an email to [email protected] and request it. Yeah. Um, all right. So, let me stop sharing my screen.
If anybody uh purchased the course in the last uh you know during this hour, please send us an email at coursesopencv.org. org so that we can send you uh the additional free gift which is a book by Francois uh deep learning by Python. It is not included in the current um thing you have. Uh I also want to say you know we also have a webinar um that we will do on uh let me share my screen. So we are running this um on the last day of our um on the last day of our um of this you know uh this event uh this sale event we will have uh this webinar that starts uh at 5:00 p.m. specific time and we will go over you know what the courses are everything we will go over this and you will also learn you know why why should you do uh computer vision now all sorts of things so sign up for this webinar at opencv.org/weinar org/webinar uh and uh you know uh this is the last day of the sale event. So we will basically be there for as long as it is necessary for to answer all the questions. Right. So uh yeah 5:00 p.m. Pacific time September 2nd 2025 uh sign up for this webinar. I’ll be there for uh I plan to be there till the end of uh the sale event which is midnight Pacific time. So we’ll start at 5:00 p p.m. Pacific time. Uh I’ll be there throughout we will be talking about many different things you know how to start a career but also about uh various other things right how do you actually do um how do you actually in in a real world scenario how do you go from uh various steps right model deployment this and that. So many of these things we will cover in those uh in that um in that webinar as well. So opencv.org/webinar.
Yes. And sign up. Um these things are free folks. It’s it’s amazing that they’re still free. Maybe uh Satia will start charging for these at some point. But that that day has not yet arrived. Um yeah and full disclosure full disclosure uh you know this uh this is the last day of u the sale right so there will be selling involved right there is information involved but there is also selling involved um I’m uh I just want to be upfront that it is not uh this kind of webinar which is only educational there it is educational plus selling uh you have been warned Hey, you know, we we try to be honest with you folks out there. So, uh, and speaking of honesty and speaking of free stuff, I think now is a great time to do our giveaway. So, I’m going to go ahead and in fact bring our chat up on the stream here. Um, so, uh, unfortunately I can’t put our Zoom chat up on the screen, but I am monitoring it in a separate window. Um, okay. So, the way this works is I’m going to ask a trivia question based on today’s presentation and the very first person to answer that question correctly in the chat either on LinkedIn, YouTube, Twitch, um, or on Zoom. Also, Satia, there’s a there’s a typo in your in your link there. Um, says webinar, not webinar.
Um the winner. Thanks for that chat. I appreciate it. Yeah. Um we will uh so the the way this works is I’m gonna ask a trivia question based on today’s presentation and the very first person to answer that correctly will win the OpenCV university course of their choosing. You can go to opencv.org/university
to see what courses are on offer. or you can scan the QR code that I’ve just put up on the screen right under our faces here if you’re watching on the live stream. Transform your career with computer vision deep learning and OpenCV courses at OpenCV University. Go ahead and scan that QR code. I’m going to remove the chat here for a moment and uh talk a little bit about our sponsors for today’s episode. But today’s episode is brought to you by OpenCV University as uh Doc talked about a little bit earlier. Um there is in fact a fantastic sale going on on OpenCVU. Um you can go to opencv.org/university
to get the Labor Day sale for just 17 hours 22 minutes and 52 seconds longer using code labor 40. That’s labor 40 for 40% off all OpenCV programs and courses. That is a huge chunk of savings. Don’t miss out on it. That’s opencv.org/university.
This episode is al also brought to you by and partners. OpenCV is a nonprofit organization that puts out free open-source software and as such we depend on the support of organizations just like the ones on your screen right now. Big shout outs to ARM, Qualcomm, RunPod, Intrinsic, Futureway, Jet Brains, Rooflow, Haxter, Seven Sense, Google Summer of Code, Open M, our newest bronze member, Orbic, Tang Vision, AMP Software, Intuitivo, Rerun.io, the Edge AI and Vision Alliance, and Big Vision. If your company uses OpenCV and wants your name to be listed amongst these industry titans, please send one email to Phil at opencv.org and we’ll get talking about it. Um, OpenCV membership has a program designed for companies of all sizes. Whether you’re a new scrappy startup or whether you’re a huge established company like ARM or Qualcomm or whether you’re somewhere in the middle like Rooflow, OpenCV membership has a plan that will work for you and you’ll be able to help support the library that you depend on.
Is also supported by our sponsors on GitHub. By the way, your screeners, we got 14 of them right now. What’s that? Sorry, the screen. Yeah, we’re having some local internet slowness here. It’s We’ll figure it out. Um, yes. So, uh, we’re also supported by our GitHub sponsors. You can go to github.com/sponsors/opencv
and be as cool as Christopher. Uh, man, this is this to me looks like maybe a Polish name. I don’t know if I can do this one. Piraski. Um, we’ve got DJ. We got Tegan Burns. We’ve got Live Pier. We’ve got the homies at Rooflow, double supporters of Open CV. Jesus Anna, DJ Greenwood, IPOP.AI, we’ll have to have them back on the show sometime soon. Comet ML,
Jonas Heinla, Alexander Ismolof, Tala Hussein, Nick Libertini, Blue J, and Ruters Laboratories. We’re also brought to you today by OpenCV’s official merch shop on at opencv.mmyspreadshop.com. We recently slashed prices on everything on the shop. It is now about 50% what it cost just a few weeks ago. For example, iconic statement um on a shirt if the page for me. We’ll we’ll see. We’ll see. $3 versus was about the the 40 that we had it set as last time. Our thinking with this this was, you know, if you want an open CV shirt, you really want for the import CV2 shirt on a very comfortable 100%. You’ve also got the open
clothing also tote bags, pins, fridge magnets. I love OpenCV in English simplified Chinese and an espanol yoamo. Uh OpenCV. Uh you can buy these things at once again opencv.mmyspreadshop.com or you can scan the QR code that I just put up on the screen. 23 bucks for a pretty sweet Open CV t-shirt. Um, since we dropped the prizes earlier this week, we’ve actually sold uh twice as many shirts in the last week than we sold in the last year. So, uh, feeling pretty good about it. Feeling pretty good about it. Really? Yeah. Okay. Well, uh, if you if you drop the price by this much, I think we are definitely not getting any donations really. But it is for open season. Yeah. You know, I mean, it’s it’s promo, right? I think I think at this point inside baseball I think we make about five bucks a shirt or something like that. So, you know, it’s not it’s not that it’s not that much, but it it is it is cool to see people like at your CVPRs or these conferences we go to wearing OpenCV gear. So, buy a shirt. Uh you are supporting OpenCV in your in your small way. Um and we really do appreciate all of these ways for folks to donate. Um before the end of this year, we will also produce um an OpenCV 25 year limited edition t-shirt. So stay tuned for that. Yeah. Or possibly even a challenge coin if you’re into that. Um we we were just talking about this. But okay, uh it is now time for the trivia giveaway. First person to answer in the chat wins the prize. Um, somebody in on YouTube already tried to answer the question before I asked it. Um, bad news. You’re wrong. But, uh, I appreciate your gumption. I really do. Um, I think, uh, uh, you know, trying to get in my head is is is not easy. Oh, wow. Somebody just said they’re buying 10 shirts for their staff at work. Hell yeah. Oh, wow. Thank you so much. Um, thank you, Carlos. We really appreciate that. Um, so we have talked about Quen uh 2.5BL today. Uh, if you’ve won in the last couple of months, please do not answer and give someone else a chance to win. By the way, uh, if you’re answering on Zoom, everyone, and not just hosts and panelists, if you’re watching anywhere else, just post it in the chat. Um, I’ve got both of these windows up right now. We’ve talked a lot about Quen 2.5VL today. in 2.5VL release. What month and year both parts of the answer to win here today?
You got me. Oh, you’re close, Lawrence. I think this is the first time you got me. I don’t know exact. Oh, hey, we got it. It looks like uh look looking here. I see uh Pom Pomudu Pamudu over on Zoom. uh just answered
January. That is indeed correct was uh January 28 official Alibaba blog. Um so congratulations. I know we’re having a little bit of sorry folks. Um but the good news is the laggginess applies to everybody. Zoom is getting the same lag as YouTube. We’re getting the same lag as Twitch. Um, and so, you know, it is what it is. Um, but yes, some of you folks out there also got the right answer. Um, Pamudu is our winner for this week’s trivia giveaway. Please go to opencv.org/university.
Pick out a course, send me an email that’s philopc.org
with the name of the course that you would like and we will make sure you get that added to your OpenCVU dashboard. Congratulations, Pamudu. We’ve got a few questions here, Doc, while we uh before we wrap up here. Um we got somebody asks, “Can you send t-shirts overseas to Angola?” I don’t know about Angola specifically, but if you go to the opencv.mmyspreadshop.com
and add something to your cart, it’ll have a list in the checkout of the of countries they do ship to. And I know it it’s it ships pretty fairly like almost all over the world where it’s that’s like possible to get stuff into basic basically uh without without you know grease and some palms as we as we’ve talked about on this show over the years. So uh thanks for your interest and uh hopefully we can get you a shirt. Um got a couple of questions here saved. Let me go ahead and I’m going to drop off my face so Doc can answer some of these questions himself here. Um, here’s a good one from earlier in the stream from uh from Shan. What are some common mistakes if an image is not classifying correctly and how can we fix them? Well, so uh if it is not a VLM, right? In a VLM, you can use a better prompt, right? So, you can say that let’s say uh it doesn’t happen uh but let’s say uh a simple example, right? uh let’s say it gets uh confused between uh between a dog and a wolf, right? Then uh you would run your you know the prompt you would change that pay special attention right to u dogs and wolves right so most of the things with uh you want to first use prompt engineering to get uh this image classifier if you’re using a VLM right if you’re using a VLM uh it is tricky because this is a large uh large model right you don’t want to jump into finetuning immediately uh so you want to do something using prompt engineering as much as you can. So you would give extra prompts for classes it is trying it it is getting mclassified with saying that pay special attention or you know uh make sure that you don’t do this uh do that. So it does uh those kind of prompting does affect the final output. Uh but you can also fine-tune the model uh to give better results. Right? So you can add image caption pairs to improve the quality uh of the output. But in standard image classification, right, the best way is to just fine-tune the model, right? Because those are not difficult to fine-tune. They’re they’re pretty uh easy to fine-tune. Just wherever it’s getting confused, let’s say it is getting confused between dogs and wolves. Take those examples, the hard examples, this is called hard negative mining. So you take the hard examples and put it in the training set and fine-tune, right? It’s as simple as that. find all the hard examples where the mis mclassification is happening and then uh fine-tune the model again and that pretty much fixes the problem most of the time.
So the in if you’re using standard image classification network the mistake you have made is that your training data is not rich enough right so you basically uh fix your training data
Phil,
where’s Phil?
Sorry guys, we are having some uh technical issues today because of uh uh because of the network.
Oh, let me see if I can I can find some questions.
All right. Uh can we okay says can we please have a session on quen image edit 2 point taken uh we’ll we’ll see we uh when we can fit that in. So we uh basically we have uh a very packed schedule for the next few weeks. So whenever we have some uh you know guest drop off or something happens then we we modify that uh session to a tutorial session. But uh that’s a that’s a uh you know I’ll I’ll look into it how much time it takes to fit that in and if it is in a one in 1 hour we can fit that in we’ll definitely try to do that. That’s a good request. Thank you.
All right. So I’m not sure what’s uh going on. uh right now with the with the network issue. So uh but we are close to uh you know our program already and I have a hard uh stop in just a few minutes. So uh let me conclude this uh show um and and thank you Phil as I said that if anything goes wrong it is his uh it is uh you know Phil organizes this show and if anything goes wrong today it came true finally. Oh, you’re back. Um, for some reason it it put my screen up as well. I don’t know why it did that. Um, yeah. Sorry about that, folks. We we had a uh a a total internet dropout here um in uh in in Tijuana. So, I think it’s it might be because it’s raining a little bit out there. Anyway, um thank you. Uh yeah, Camille. Um maybe we will have a a Quen image edit tutorial here. I’m pulling up the chat. I think we had we had at least one more good Unfortunately, we just lost Zoom. Sorry, folks. Um, yeah, I think we’re we’re just fizzling out here. It’s okay. I think uh we can conclude today’s session and uh the questions we can answer by email.
Son of a [ __ ] You’re Phil, you’re online.
All right. So, uh, Phil thinks that he is offline. Uh, but we are online and, uh, so I I just want to thank everybody who is here today for this uh, for this webinar. Thank you so much. And uh also our sale is going on. If you’re interested in the OpenC CV University courses which are your guide to become an AI engineer, please uh join today. This is one of the best deals you can find. And if you purchased a course a program during this webinar, please send us an email at coursesopencv.org and we will be able to uh you’ll you’ll receive the free book. All right. Uh that’s it. Thank you so much. Uh thank you guys. U I’m uh we have to conclude this because and unfortunately Phil is not here because of network issues but we’ll see you next week.