Viola–Jones object detection framework

Viola–Jones object detection framework

The Viola–Jones object detection framework is a machine learning object detection framework proposed in 2001 by Paul Viola and Michael Jones. It was motivated primarily by the problem of face detection, although it can be adapted to the detection of other object classes. In short, it consists of a sequence of classifiers. Each classifier is a single perceptron with several binary masks (Haar features). To detect faces in an image, a sliding window is computed over the image. For each image, the classifiers are applied. If at any point, a classifier outputs "no face detected", then the window is considered to contain no face. Otherwise, if all classifiers output "face detected", then the window is considered to contain a face. The algorithm is efficient for its time, able to detect faces in 384 by 288 pixel images at 15 frames per second on a conventional 700 MHz Intel Pentium III. It is also robust, achieving high precision and recall. While it has lower accuracy than more modern methods such as convolutional neural network, its efficiency and compact size (only around 50k parameters, compared to millions of parameters for typical CNN like DeepFace) means it is still used in cases with limited computational power. For example, in the original paper, they reported that this face detector could run on the Compaq iPAQ at 2 fps (this device has a low power StrongARM without floating point hardware). == Problem description == Face detection is a binary classification problem combined with a localization problem: given a picture, decide whether it contains faces, and construct bounding boxes for the faces. To make the task more manageable, the Viola–Jones algorithm only detects full view (no occlusion), frontal (no head-turning), upright (no rotation), well-lit, full-sized (occupying most of the frame) faces in fixed-resolution images. The restrictions are not as severe as they appear, as one can normalize the picture to bring it closer to the requirements for Viola-Jones. any image can be scaled to a fixed resolution for a general picture with a face of unknown size and orientation, one can perform blob detection to discover potential faces, then scale and rotate them into the upright, full-sized position. the brightness of the image can be corrected by white balancing. the bounding boxes can be found by sliding a window across the entire picture, and marking down every window that contains a face. This would generally detect the same face multiple times, for which duplication removal methods, such as non-maximal suppression, can be used. The "frontal" requirement is non-negotiable, as there is no simple transformation on the image that can turn a face from a side view to a frontal view. However, one can train multiple Viola-Jones classifiers, one for each angle: one for frontal view, one for 3/4 view, one for profile view, a few more for the angles in-between them. Then one can at run time execute all these classifiers in parallel to detect faces at different view angles. The "full-view" requirement is also non-negotiable, and cannot be simply dealt with by training more Viola-Jones classifiers, since there are too many possible ways to occlude a face. == Components of the framework == A full presentation of the algorithm is in. Consider an image I ( x , y ) {\displaystyle I(x,y)} of fixed resolution ( M , N ) {\displaystyle (M,N)} . Our task is to make a binary decision: whether it is a photo of a standardized face (frontal, well-lit, etc) or not. Viola–Jones is essentially a boosted feature learning algorithm, trained by running a modified AdaBoost algorithm on Haar feature classifiers to find a sequence of classifiers f 1 , f 2 , . . . , f k {\displaystyle f_{1},f_{2},...,f_{k}} . Haar feature classifiers are crude, but allows very fast computation, and the modified AdaBoost constructs a strong classifier out of many weak ones. At run time, a given image I {\displaystyle I} is tested on f 1 ( I ) , f 2 ( I ) , . . . f k ( I ) {\displaystyle f_{1}(I),f_{2}(I),...f_{k}(I)} sequentially. If at any point, f i ( I ) = 0 {\displaystyle f_{i}(I)=0} , the algorithm immediately returns "no face detected". If all classifiers return 1, then the algorithm returns "face detected". For this reason, the Viola-Jones classifier is also called "Haar cascade classifier". === Haar feature classifiers === Consider a perceptron f w , b {\displaystyle f_{w,b}} defined by two variables w ( x , y ) , b {\displaystyle w(x,y),b} . It takes in an image I ( x , y ) {\displaystyle I(x,y)} of fixed resolution, and returns f w , b ( I ) = { 1 , if ∑ x , y w ( x , y ) I ( x , y ) + b > 0 0 , else {\displaystyle f_{w,b}(I)={\begin{cases}1,\quad {\text{if }}\sum _{x,y}w(x,y)I(x,y)+b>0\\0,\quad {\text{else}}\end{cases}}} A Haar feature classifier is a perceptron f w , b {\displaystyle f_{w,b}} with a very special kind of w {\displaystyle w} that makes it extremely cheap to calculate. Namely, if we write out the matrix w ( x , y ) {\displaystyle w(x,y)} , we find that it takes only three possible values { + 1 , − 1 , 0 } {\displaystyle \{+1,-1,0\}} , and if we color the matrix with white on + 1 {\displaystyle +1} , black on − 1 {\displaystyle -1} , and transparent on 0 {\displaystyle 0} , the matrix is in one of the 5 possible patterns shown on the right. Each pattern must also be symmetric to x-reflection and y-reflection (ignoring the color change), so for example, for the horizontal white-black feature, the two rectangles must be of the same width. For the vertical white-black-white feature, the white rectangles must be of the same height, but there is no restriction on the black rectangle's height. ==== Rationale for Haar features ==== The Haar features used in the Viola-Jones algorithm are a subset of the more general Haar basis functions, which have been used previously in the realm of image-based object detection. While crude compared to alternatives such as steerable filters, Haar features are sufficiently complex to match features of typical human faces. For example: The eye region is darker than the upper-cheeks. The nose bridge region is brighter than the eyes. Composition of properties forming matchable facial features: Location and size: eyes, mouth, bridge of nose Value: oriented gradients of pixel intensities Further, the design of Haar features allows for efficient computation of f w , b ( I ) {\displaystyle f_{w,b}(I)} using only constant number of additions and subtractions, regardless of the size of the rectangular features, using the summed-area table. === Learning and using a Viola–Jones classifier === Choose a resolution ( M , N ) {\displaystyle (M,N)} for the images to be classified. In the original paper, they recommended ( M , N ) = ( 24 , 24 ) {\displaystyle (M,N)=(24,24)} . ==== Learning ==== Collect a training set, with some containing faces, and others not containing faces. Perform a certain modified AdaBoost training on the set of all Haar feature classifiers of dimension ( M , N ) {\displaystyle (M,N)} , until a desired level of precision and recall is reached. The modified AdaBoost algorithm would output a sequence of Haar feature classifiers f 1 , f 2 , . . . , f k {\displaystyle f_{1},f_{2},...,f_{k}} . The details of the modified AdaBoost algorithm is detailed below. ==== Using ==== To use a Viola-Jones classifier with f 1 , f 2 , . . . , f k {\displaystyle f_{1},f_{2},...,f_{k}} on an image I {\displaystyle I} , compute f 1 ( I ) , f 2 ( I ) , . . . f k ( I ) {\displaystyle f_{1}(I),f_{2}(I),...f_{k}(I)} sequentially. If at any point, f i ( I ) = 0 {\displaystyle f_{i}(I)=0} , the algorithm immediately returns "no face detected". If all classifiers return 1, then the algorithm returns "face detected". === Learning algorithm === The speed with which features may be evaluated does not adequately compensate for their number, however. For example, in a standard 24x24 pixel sub-window, there are a total of M = 162336 possible features, and it would be prohibitively expensive to evaluate them all when testing an image. Thus, the object detection framework employs a variant of the learning algorithm AdaBoost to both select the best features and to train classifiers that use them. This algorithm constructs a "strong" classifier as a linear combination of weighted simple “weak” classifiers. h ( x ) = sgn ⁡ ( ∑ j = 1 M α j h j ( x ) ) {\displaystyle h(\mathbf {x} )=\operatorname {sgn} \left(\sum _{j=1}^{M}\alpha _{j}h_{j}(\mathbf {x} )\right)} Each weak classifier is a threshold function based on the feature f j {\displaystyle f_{j}} . h j ( x ) = { − s j if f j < θ j s j otherwise {\displaystyle h_{j}(\mathbf {x} )={\begin{cases}-s_{j}&{\text{if }}f_{j}<\theta _{j}\\s_{j}&{\text{otherwise}}\end{cases}}} The threshold value θ j {\displaystyle \theta _{j}} and the polarity s j ∈ ± 1 {\displaystyle s_{j}\in \pm 1} are determined in the training, as well as the coefficients α j {\displaystyle \alpha _{j}} . Here a simplified version of the lea

PhotoWorks (ray tracing software)

PhotoWorks is a raytrace rendering program created by Dassault Systèmes SolidWorks Corporation, formerly supplied as a photorealistic rendering add-in for SolidWorks. The program is based on the Mental Ray rendering engine. It has a library of scenes and materials that can be used with user-created SolidWorks files to create still frame images within the SolidWorks GUI. Since the 2011 release of SolidWorks, PhotoWorks has been replaced by the PhotoView 360 rendering utility. A 2010 review comparing PhotoWorks with three other rendering programs for SolidWorks (including PhotoView 360) gave the program high marks for render speed and built-in materials, but low marks for realism and user interface. Appearance File Type: .p2m

Control communications

In telecommunications, control communications is the branch of technology devoted to the design, development, and application of communications facilities used specifically for control purposes, such as for controlling (a) industrial processes, (b) movement of resources, (c) electric power generation, distribution, and utilization, (d) communications networks, and (e) transportation systems.

Digital intermediate

Digital intermediate (DI) is a motion picture finishing process which classically involves digitizing a motion picture and manipulating the color and other image characteristics. == Definition and overview == A digital intermediate often replaces or augments the photochemical timing process and is usually the final creative adjustment to a movie before distribution in theaters. It is distinguished from the telecine process in which film is scanned and color is manipulated early in the process to facilitate editing. However the lines between telecine and DI are continually blurred and are often executed on the same hardware by colorists of the same background. These two steps are typically part of the overall color management process in a motion picture at different points in time. A digital intermediate is also customarily done at higher resolution and with greater color fidelity than telecine transfers. Although originally used to describe a process that started with film scanning and ended with film recording, digital intermediate is also used to describe color correction and color grading and even final mastering when a digital camera is used as the image source and/or when the final movie is not output to film. This is due to recent advances in digital cinematography and digital projection technologies that strive to match film origination and film projection. In traditional photochemical film finishing, an intermediate is produced by exposing film to the original camera negative. The intermediate is then used to mass-produce the films that get distributed to theaters. Color grading is done by varying the amount of red, green, and blue light used to expose the intermediate. The digital intermediate process uses digital tools to color grade, which allows for much finer control of individual colors and areas of the image, and allows for the adjustment of image structure (grain, sharpness, etc.). The intermediate for film reproduction can then be produced by means of a film recorder. The physical intermediate film that is a result of the recording process is sometimes also called a digital intermediate, and is usually recorded to internegative (IN) stock, which is inherently finer-grain than original camera negative (OCN). One of the key technical achievements that made the transition to DI possible was the use of 3D look-up tables, which could be used to mimic how the digital image would look once it was printed onto release print stock. This removed a large amount of guesswork from the film-making process, and allowed greater freedom in the colour grading process while reducing risk. The digital master is often used as a source for a DCI-compliant distribution of the motion picture for digital projection. For archival purposes, the digital master created during the digital intermediate process can be recorded to very stable high dynamic range yellow-cyan-magenta (YCM) separations on black-and-white film with an expected 100-year or longer life. While still subject to the natural degradation of any analog chemical master, this archival format, long used in the industry prior to the invention of DI, was considered valuable for providing an archival medium that is independent of changes in digital data recording technologies and file formats that might otherwise render digitally archived material unreadable in the long term. A "film intermediate" is an analog variation of a digital intermediate, where a project shot on digital video is printed onto film stock and transferred back to digital video to emulate film. The term was coined after it was used on the Oscar-winning 2012 short film "Curfew". The process was also used on the films Dune (2021) and The Batman (2022). == History == Telecine tools to electronically capture film images are nearly as old as broadcast television, but the resulting images were widely considered unsuitable for exposing back onto film for theatrical distribution. Film scanners and recorders with quality sufficient to produce images that could be inter-cut with regular film began appearing in the 1970s, with significant improvements in the late 1980s and early 1990s. During this time, digitally processing an entire feature-length film was impractical because the scanners and recorders were extremely slow and the image files were too large compared to computing power available. Instead, individual shots or short sequences were processed for visual effects. In 1992, Visual Effects Supervisor/Producer Chris F. Woods broke through several "techno-barriers" in creating a digital studio to produce the visual effects for the 1993 release Super Mario Bros. It was the first feature film project to digitally scan a large number of VFX plates (over 700) at 2K resolution. It was also the first film scanned and recorded at Kodak's just launched Cinesite facility in Hollywood. This project based studio was the first feature film to use Discreet Logic's (now Autodesk) Flame and Inferno systems, which enjoyed early dominance as high resolution / high performance digital compositing systems. Digital film compositing for visual effects was immediately embraced, while optical printer use for VFX declined just as quickly. Chris Watts further revolutionized the process on the 1998 feature film Pleasantville, becoming the first visual effects supervisor for New Line Cinema to scan, process, and record the majority of a feature-length, live-action, Hollywood film digitally. The first Hollywood film to utilize a digital intermediate process from beginning to end was O Brother, Where Art Thou? in 2000 and in Europe it was Chicken Run released that same year. The process rapidly caught on in the mid-2000s. Around 50% of Hollywood films went through a digital intermediate in 2005, increasing to around 70% by mid-2007. This is due not only to the extra creative options the process affords film makers but also the need for high-quality scanning and color adjustments to produce movies for digital cinema. == Milestones == 1990: The Rescuers Down Under – First feature-length film to be entirely recorded to film from digital files; in this case animation assembled on computers using Walt Disney Feature Animation and Pixar's CAPS system. 1992: Visual effects supervisor and producer Chris F. Woods creates a VFX studio to produce the visual effects for the 1993 film Super Mario Bros. It was the first 35mm feature film to digitally scan a large number of VFX plates (over 700) at 2K resolution, as well as to output the finished VFX to 35mm negative at 2K. 1993: Snow White and the Seven Dwarfs – First film to be entirely scanned to digital files, manipulated, and recorded back to film at 4K resolution. The restoration project was done entirely at 4K resolution and 10-bit color depth using the Cineon system to digitally remove dirt and scratches and restore faded colors. 1998: Pleasantville – The first time the majority of a new feature film was scanned, processed, and recorded digitally. The black-and-white meets color world portrayed in the movie was filmed entirely in color and selectively desaturated and contrast adjusted digitally. The work was done in Los Angeles by Cinesite utilizing a Spirit DataCine for scanning at 2K resolution and a MegaDef color correction system from UK Company Pandora International 1998: Zingo - The first feature film to use digital color correction via digital intermediate in its entirety. The work was performed at the Digital Film Lab in Copenhagen, using a Spirit Datacine to transfer the entire film to digital files at 2K resolution. The digital intermediate process was also used to perform a digital blowup of the film's original Super 16 source format to a 35mm output. 1999: Pacific Ocean Post Film, a team led by John McCunn and Greg Kimble used Kodak film scanners & laser film printer, Cineon software as well as proprietary tools to rebuild and repair the first two reels of the 1968 Beatles' film Yellow Submarine for re-release. 1999: Star Wars: Episode I – The Phantom Menace - Industrial Light & Magic (ILM) scanned the entirety of the visual effects-laden film for the purposes of digital enhancement and the integration of thousands of separately filmed elements with computer generated characters and environments. Outside of the approximately 2000 effects shots that were digitally manipulated, the remaining 170 non-effects shots were also scanned for continuity. However, after the digital shots were manipulated at ILM, they were filmed out individually and sent to Deluxe Labs where they were processed and color timed photochemically. 2000: Sorted - The first feature-length, color 35mm motion picture to fully utilize the digital intermediate process in its entirety from inception to completion. The film was produced at Wave Pictures' digital intermediate film facility in London, England. It was scanned at 2K resolution with 8 bits color depth per color / per pixel using a pin registered, liquid gate Oxberry

Communications system

A communications system is a collection of individual telecommunications networks systems, relay stations, tributary stations, and terminal equipment usually capable of interconnection and interoperation to form an integrated whole. Communication systems allow the transfer of information from one place to another or from one device to another through a specified channel or medium. The components of a communications system serve a common purpose, are technically compatible, use common procedures, respond to controls, and operate in union. In the structure of a communication system, the transmitter first converts the data received from the source into a light signal and transmits it through the medium to the destination of the receiver. The receiver connected at the receiving end converts it to digital data, maintaining certain protocols e.g. FTP, ISP assigned protocols etc. Telecommunications is a method of communication (e.g., for sports broadcasting, mass media, journalism, etc.). Communication is the act of conveying intended meanings from one entity or group to another through the use of mutually understood signs and semiotic rules. == Types == === By media === An optical communication system is any form of communications system that uses light as the transmission medium. Equipment consists of a transmitter, which encodes a message into an optical signal, a communication channel, which carries the signal to its destination, and a receiver, which reproduces the message from the received optical signal. Fiber-optic communication systems transmit information from one place to another by sending light through an optical fiber. The light forms a carrier signal that is modulated to carry information. A radio communication system is composed of several communications subsystems that give exterior communications capabilities. A radio communication system comprises a transmitting conductor in which electrical oscillations or currents are produced and which is arranged to cause such currents or oscillations to be propagated through the free space medium from one point to another remote therefrom and a receiving conductor at such distant point adapted to be excited by the oscillations or currents propagated from the transmitter. Power-line communication systems operate by impressing a modulated carrier signal on power wires. Different types of power-line communications use different frequency bands, depending on the signal transmission characteristics of the power wiring used. Since the power wiring system was originally intended for transmission of AC power, the power wire circuits have only a limited ability to carry higher frequencies. The propagation problem is a limiting factor for each type of power line communications. === By technology === A duplex communication system is a system composed of two connected parties or devices which can communicate with one another in both directions. The term duplex is used when describing communication between two parties or devices. Duplex systems are employed in nearly all communications networks, either to allow for a communication "two-way street" between two connected parties or to provide a "reverse path" for the monitoring and remote adjustment of equipment in the field. An antenna is basically a small length of a conductor that is used to radiate or receive electromagnetic waves. It acts as a conversion device. At the transmitting end it converts high frequency current into electromagnetic waves. At the receiving end it transforms electromagnetic waves into electrical signals that is fed into the input of the receiver. several types of antenna are used in communication. Examples of communications subsystems include the Defense Communications System (DCS). === Examples: by technology === Telephone Mobile phone Tablet computer Television Telegraph Edison Telegraph TV cable Computer === By application area === The term transmission system is used in the telecommunications industry to emphasize the intermediate media, protocols, and equipment in the circuit, rather than particular end-user applications. A tactical communications system is a communications system that (a) is used within, or in direct support of tactical forces (b) is designed to meet the requirements of changing tactical situations and varying environmental conditions, (c) provides securable communications, such as voice, data, and video, among mobile users to facilitate command and control within, and in support of, tactical forces, and (d) usually requires extremely short installation times, usually on the order of hours, in order to meet the requirements of frequent relocation. An Emergency communication system is any system (typically computer based) that is organized for the primary purpose of supporting the two way communication of emergency messages between both individuals and groups of individuals. These systems are commonly designed to integrate the cross-communication of messages between are variety of communication technologies. An Automatic call distributor (ACD) is a communication system that automatically queues, assigns and connects callers to handlers. This is used often in customer service (such as for product or service complaints), ordering by telephone (such as in a ticket office), or coordination services (such as in air traffic control). A Voice Communication Control System (VCCS) is essentially an ACD with characteristics that make it more adapted to use in critical situations (no waiting for dial tone, or lengthy recorded announcements, radio and telephone lines equally easily connected to, individual lines immediately accessible etc..) == Key components == =

Free boundary condition

In image processing, the free boundary condition is the convention used when applying a convolution kernel to a digital image in which pixel locations that lie outside the image boundaries are interpreted as having a value of zero.[1] The question of what value to assign out-of-bounds pixels may arise, for instance, when applying a 3×3 kernel to the corner pixel in an image.

Open Sound Control

Open Sound Control (OSC) is a protocol for networking sound synthesizers, computers, and other multimedia devices for purposes such as musical performance or show control. OSC's advantages include interoperability, accuracy, flexibility and enhanced organization and documentation. Its disadvantages include higher bandwidth requirements, increased load on embedded processors, and lack of standardized messages/interoperability. The first specification was released in March 2002. == Motivation == OSC is a content format developed at CNMAT by Adrian Freed and Matt Wright comparable to XML, WDDX, or JSON. It was originally intended for sharing music performance data (gestures, parameters and note sequences) between musical instruments (especially electronic musical instruments such as synthesizers), computers, and other multimedia devices. OSC is sometimes used as an alternative to the 1983 MIDI standard, when higher resolution and a richer parameter space is desired. OSC messages are transported across the internet and within local subnets using UDP/IP and Ethernet. OSC messages between gestural controllers are usually transmitted over serial endpoints of USB wrapped in the SLIP protocol. == Features == OSC's main features, compared to MIDI, include: Open-ended, dynamic, URI-style symbolic naming scheme Symbolic and high-resolution numeric data Pattern matching language to specify multiple recipients of a single message High resolution time tags "Bundles" of messages whose effects must occur simultaneously == Applications == There are dozens of OSC applications, including real-time sound and media processing environments, web interactivity tools, software synthesizers, programming languages and hardware devices. OSC has achieved wide use in fields including musical expression, robotics, video performance interfaces, distributed music systems and inter-process communication. The TUIO community standard for tangible interfaces such as multitouch is built on top of OSC. Similarly the GDIF system for representing gestures integrates OSC. OSC is used extensively in experimental musical controllers, and has been built into several open source and commercial products. The Open Sound World (OSW) music programming language is designed around OSC messaging. OSC is the heart of the DSSI plugin API, an evolution of the LADSPA API, in order to make the eventual GUI interact with the core of the plugin via messaging the plugin host. LADSPA and DSSI are APIs dedicated to audio effects and synthesizers. In 2007, a standardized namespace within OSC called SYN, for communication between controllers, synthesizers and hosts, was proposed. == Design == OSC messages consist of an address pattern (such as /oscillator/4/frequency), a type tag string (such as ,fi for a float32 argument followed by an int32 argument), and the arguments themselves (which may include a time tag). Address patterns form a hierarchical name space, reminiscent of a Unix filesystem path, or a URL, and refer to "Methods" inside the server, which are invoked with the attached arguments. Type tag strings are a compact string representation of the argument types. Arguments are represented in binary form with four-byte alignment. The core types supported are 32-bit two's complement signed integers 32-bit IEEE floating point numbers Null-terminated arrays of eight-bit encoded data (C-style strings) arbitrary sized blob (e.g. audio data, or a video frame) An example message is included in the spec (with null padding bytes represented by ␀): /oscillator/4/frequency␀,f␀␀, Followed by the 4-byte float32 representation of 440.0: 0x43dc0000. Messages may be combined into bundles, which themselves may be combined into bundles, etc. Each bundle contains a timestamp, which determines whether the server should respond immediately or at some point in the future. Applications commonly employ extensions to this core set. More recently some of these extensions such as a compact Boolean type were integrated into the required core types of OSC 1.1. The advantages of OSC over MIDI are primarily internet connectivity; data type resolution; and the comparative ease of specifying a symbolic path, as opposed to specifying all connections as seven-bit numbers with seven-bit or fourteen-bit data types. This human-readability has the disadvantage of being inefficient to transmit and more difficult to parse by embedded firmware, however. The spec does not define any particular OSC Methods or OSC Containers. All messages are implementation-defined and vary from server to server.