CVS Model for AI Vision:
Neuromodulation for Recognition and Classification of the visual Field.
This article is a direct follow-up on a larger investigation towards understanding the act of seeing, as a process of collapsing a field of ambiguity into meaning. The core of this spesific article, swarms around the possibility of transforming collapse-verify-stamp as computational and practical frame for AI vision.Before i move further into the topic itself, I recommend taking a look at the original foundation from which these ideas compound. You can access the Visual Logic Framework directly from here.WHAT IT MEANS TO SEE
To simplify the complexity of what it means to “see”, we must first establish a common ground to tread on.
Seeing means different things depending on which field is observing it.
Science, philosophy, art, education and technology all value and treat the concept differently.
As a matter of fact, in many cases, these fields disagree on a large variety of fundamental issues.
This article, hopefully provides clarity, showing that there are far more agreements beneath the surface than actual conflicts.
The foundation for the CVS model (collapse-verify-stamp), and for what will be proposed here, is grounded in a history of scientific discoveries, techonlogical applications and artistic methods, which are explored in my other articles.
This compressed principle of CVS is an attempt to transpose what can be applied from the biological model- the human aspect of vision-into an artificial form.
Following the tradition of my prior articles, we don’t start with numbers, we start with a question: If ambiguity becomes resolved, it has to be attended. How does the information flow that vision is built from?
WHAT CAN BE SEEN?
EXTRACTION OF INFORMATION IN VISUALS
1. Light interaction with the environmentAs I have mentioned before with my articles, light interacts with the environment.
To dive slightly deeper into that, we can simplify this stage thru physics alone.
Light is energy, and as photons encounter matter, they interact with it.
Depending on the matter itself, reactions happen on different scales.
For example:
Energy is converted to heat.
If you have ever worn a black shirt on a sunny warm day, you have noticed that your shirt gets hot.
What happens in reality is that dark or black matter doesn’t reflect much light; it absorbs it.
The energy that the light carries doesn’t disappear - it converts to heat.
Every color and shade that you perceive is what is left over from these interactions.
Everything that light touches becomes a mathematical equation and is entirely predictable, since materials nearly always react to it in the exact same way under specific conditions .
Think of them as fingerprints.
That is how we recognize and differentiate that metal, glass, organic fibers or stone, all have different appearances.
Humans do not measure or calculate all possibilities; we extract only the ones within specific range.
2. LIght within human sensory thresholdsLike mentioned, our brains don’t need to compute the world or calculate numbers to learn to recognize these differences.
Biology developed the perfect precision tools for the job.
As reflected light (the remining energy after environmental interactions) enters our eyes, teh receptors in our retinas don’t react to full scalar spectrum, but to spesific ranges of wavelengths.
Depending on the luminance of the light sources (the exposure level), an adaptive threshold is applied.
The pupil opens or retracts to capture the optimal ratio.
These thresholds create a sensory scalar that adjusts to conditions, stripping away excess noise and filtering the stimuli.
We see different data in the dark than we see in the light.
Adaptive thresholds can be observed with a simple test when different visual values are intoduced against varying background exposure levels.
Between two absolute exposure levels, our perception of the values changes.
In one scenario, we might not be able to effectively distuinguish differences in group A or B - the values appear nearly identical, however, in group C, we are able to read the differences clearly.
If we change the exposure level to the opposite end of the spectrum, the phenomenon flips: now the diffrences in Group A are clear, while Group C and B appear to have identical values inside of them.
If we shift the exposure level live, the dots almost appear animated.
That is how fast our eyes adjust to conditions.
This function of sensory differences is known as the Weber-Fechner law, which describes what is known as the JND (Just Noticeable Difference).
It is the biological foundation from which image editing properties, such as contrast and histogram curves derive from.
The conditions of appearance is determined by Energy interaction (the field of possibilities) filtered thru Sensory thresholds.
This establishes the foundation - a puzzle for the brain to solve.
S = E / sensory_threshold > What can be seen.
WHAT IS SEEN?
Human brains don’t recompute the entire visual field and solve the puzzle frame-by-frame.
The brain predicts based on priors, which include the memorization of patterns, expectations and neural reactions.
We could think of this as going to a massive library.
The first time you visit, the sheer amount of shelves, books and different categories makes the experience overwhelming when you are trying to search for a specific book.
Some people might begin to explore what they can find, some use the terminals to locate a section.
And some seek help from the librarians.
The more times you visit, the less complex it becomes.
If you worked there as a librarian, navigating the space is obvious and routine.
This isn’t about the absolute memorization of every single item; it is pattern-based.
If the library gets renovated - even if every shelf, book and category remains identical, and only the physical placement of the shelves is changed - it will require relearning even for the professionals.
Muscle memory works in similar way.
You learn to ride a bicycle, and it becomes automated function, what we refer to as a learned skill.
In visuals, this is known as pattern recognition.
As our visual systems develop and our ability to extract information from our surroundings improves, we slowly begin memorizing what the information represents.
This is where the “energy fingerprints” play a significant role.
We quite often take it for granted, and don’t acknowledge how big of an impact visual patterns have on our sense of reality.
This might be due to an early developmental stage known as infant amnesia.
We simply do not remember the first time we saw a mirror, or an object that was yellow.
As far as we remember, we have always known what metal is - and not just metal, but different types of metal.
We know what glass is, and the amount of refraction it creates, even if we don’t possess teh vocabulary to explain what “refraction” means.
We actively recognize patterns until they become learned and consolidated.
But there is no actual “hard drive” or memory bank where the raw visual data is stored.
It is a learned attention mechanism.
The question then becomes: How are cognitive resources distributed within this learning mechanism?
After all, if we just randomly selected data, or tried to process everything in our visual filed at once, it would overwhelm our neural processing. How do we know what, and when to select?
TWO OPPOSING FORCES OF CLASSIFICATION
Most of us have memory of the funny mirrors at amusement parks.
When you look at yourself in one, it creates a distorted reflection.
We hold that in our memory likely because it is quite rare.
But as I have watched all three of my kids grow up, they all went thru a period where a regular mirror was the most fascinating thing in the world to play with.
When something grabs our attention this way, we focus on it naturally.
We study it, we play with it and we stare at it.
More importantly, we engage with it.
We think about it from several perspectives.
How does it work, and imagine the possibilities we could use it for.
Throughout your life, every time you saw something for the first time, it triggered this exact same process.
Every material was once ambigious and extraordinary to you.
Look at toddlers: They play with rocks, sand, grass, sticks, mud and water.
That is how you familiarized and trained your attention mechanism to recognize everything you see today.
But there was a time when those were all anomalies, and playing with them allowed you to be surprised from the discoveries that followed that play.
The assumption among the general population regarding neuromodulators (like dopamine) swirls around the idea that it is a “reward” mechanism - something that gives us a good feeling when we get things right.
But in fact, dopamine plays a much more mechanical role in that cascade.
Science dictates that neurons either fire, or they fall silent.
There are no backwards firing mechanics, but a dip below baseline can itself carry the negative signal.
These firing neurons cause chemical reactions that we identifie as feelings or emotions thru complex nervous system that is entangled all the way.
So there are far more connections than is presented here, but when it comes to looking at it from the stand point of playing a role within CVS model, we will focus on one key quality.
Dopamine itself fires on surprise events.
What is considered a surprise is generally an anomalous event - something that does not fit into our current expectation or prediction of the environment.
It is a signal that alters our attentional awareness, and by doing so, forcing us to learn.
The outcome of that event then gets possibly classified.
If we simplify the concept, when an outcome is stable, expected or succesfully verified, and is consolidated, serotonin steps in.
Serotonin stablizes the network, acting as the counterpart to dopamine’s exploratory push.
Many of the medications used to affect these reactions, are traditionally supressors.
Despite the common thought that one is adding “more” of a substance to feel better, the medication is often used to supress an overactive firing mechanism.
Like using a fire extinguisher on a system that is burning too hot.
This stage of learning Attention can be formalized as:
A = S x pattern_weights > What is seen.
FROM AMBIGUITY TO MEANING
Now that the foundations of seeing and recognizing the visual field are set, we arrive at the final step of the collapse-verify-stamp cascade.
Patterns themselves are not memories - at least, not in the way humans consciously define them. People rarely ask, “Do you remember my coffee machine?” without a reason.
If an object wasn’t related to a specific event, or if it wasn’t visually anomalous, our brains reject it as a candidate for long-term memory.
If there is no context around it and it never triggered an anomaly, we simply do not find it meaningful enough to store.
Our concept of memories is based on events, not “things”.
They are lived experiences.
We remember emotional states and moods that, in context, are tied to sounds, smells, feelings, tastes or visuals - either individually or combined across our senses.
We all have had experiences that seem to move us backward thru time.
A specific smell might instantly transport you to a childhood moment you thought you had forgotten.
Our sensory system is exceptional in this way.
It doesn’t store things long-term simply because they are visual anomalies; it stores them because the full experience was exceptional and attended.
It assigns a salience value that carries personal meaning, perhaps to you only.
Another example is when we try to learn a new skill. Usually, this requires persistence thru several failures.
Based on our track record, our expectation is failure, and more importantly, we consciously recognize that failure.
Once we finally succeed, our system reacts powerfully.
That success triggers a massive spike and gets pushed as a candidate for long-term memory.
But the key is what happens next: when you succeed the next time, it will not cause that same massive reaction.
Your expectation switches fast.
Once you know you can do it, failing actually causes you to feel regret much more dramatically.
As we continue succeeding, the skill becomes more and more routine, until it is so obvious that we do not react to it at all.
You will probably always remember one or two of those early successes, but most of them fade into automation.
In a similar way, our visual field becomes obvious.
What gets chosen - what gets collapsed from the infinite field of possibilities - will be based on fingerprints, familiarity, context, expectations or anomalies.
It is entirely meaning-driven.
To now tie this as structural component for vision, the core of CVS would be:
E=energy
S=sensory
A=attention
M=meaning
S = E / sensory_threshold > What can be seen
A = S x pattern_weights > What is seen
M = A x memory_context > What has meaning
Each creating a multiplicative gate that doesn’t excessively recompute everything, only based on what passed thru the prior.
IMPLICATION
The original Biovision architecture, which was conceptually experimented on in Visual Logic Framework, has now been practically developed thru this CVS concept using 3D data and classification.
The current stage of the model’s architecture was initiallly tested and trained on ModelNet40, where it achieved test accuracy of 90.98%.
Further development is currently underway on ScanObjectNN’s hardest variant (PB_T50_RS), where it has recently pushed past 81% accuracy threshold running on consumer grade NVIDIA RTX 4070 ti ventus with 16GB RAM DDR5 Fury Beast - moving beyond historical baselines such as Pointnet++ and DGCNN, into the realm of heavily optimized data center architectures.
More about the explicit development stages and the performance will follow shortly in the near future.
The technical translation of this biological model into code has been more than revealing insight of how many variables move the gradients.
If you managed to read all the way thru, I would like to thank you for taking the time to do so.
It seems that in today’s era, there will be much less real readers and more crawlers doing the reading for you.
Sincerely
Mikko
note/ some graphics and other will be added shortly to accompanie this article, just my weird habit of adding those visual ingredients as update rather than doing all at once.