# FathomNet: A global image database for enabling artificial intelligence in the ocean

### FathomNet seed data sources and augmentation tools

FathomNet has been built to accommodate data contributions from a wide range of sources. The database has been initially seeded with a subset of curated imagery and metadata from the Monterey Bay Aquarium Research Institute (MBARI), National Geographic Society (NGS), and the National Oceanic and Atmospheric Administration (NOAA). Together, these data repositories represent more than 30 years of underwater visual data collected by a variety of imaging technologies and platforms around the world. To be sure, the data currently contained within FathomNet does not include the entirety of these databases, and future efforts will involve further augmenting image data from these and other resources.

#### MBARI’s video annotation and reference system

Beginning in 1988, MBARI has collected and curated underwater imagery and video footage through their Video Annotation and Reference System (VARS39). This video library contains detailed footage of the biological, geological, and physical environment of the Monterey Bay submarine canyon and other areas including the Pacific Northwest, Northern California, Hawaii, the Canadian Arctic, Taiwan, and the Gulf of California. Using eight different imaging systems (mostly color imagery and video, with more recent additions that include monochrome computer vision cameras14) deployed from four different remotely operated vehicles (ROVs MiniROV, Ventana, Tiburon and Doc Ricketts), VARS contains approximately 27,400 h of video from 6190 dives, and 536,000 frame grabs. These dives are split nearly evenly between observations in benthic (from the seafloor to 50 m above the seafloor) and midwater (from the upper surface of the benthic environment to the lower surface of the lighted shallower waters or $$\sim$$ 200 m) habitats. Image resolution has improved over the years from standard definition (SD; 640 $$\times$$ 480 pixels) to high-definition (HD; 1920 $$\times$$ 1080 pixels), with 4 K resolutions (3840 $$\times$$ 2160 pixels) starting in 2021. Additional imaging systems managed within VARS, which include a low-light camera1, the I2MAP autonomous underwater vehicle imaging payload, and DeepPIV71, are currently excluded from data exported into FathomNet. In addition to imagery and video data, VARS synchronizes ancillary vehicle data (e.g., latitude, longitude, depth, temperature, oxygen concentration, salinity, transmittance, and vehicle altitude), and is included as image metadata for export to FathomNet.

Of the 27,400 hours of video footage, more than 88% has been annotated by video taxonomic experts in MBARI’s Video Lab. Annotations within VARS are created and constrained using concepts that have been entered into the knowledge database (or knowledgebase; see Fig. S1) that is approved and maintained by a knowledge administrator using community taxonomic standards (i.e., WoRMS35) and input from expert taxonomists outside of MBARI. To date, there are more than 7.5 M annotations across 4300 concepts within the VARS database. By leveraging these annotations and existing frame grabs, VARS data were augmented with localizations (bounding boxes) using an array of publicly available72,73 and in-house74,75,76 localization and verification tools by either supervised, unsupervised, and/or manual workflows77. More than 170,000 localizations across 1185 concepts are contained in the VARS database and, due to MBARI’s embargoed concepts and dives, FathomNet contains approximately 75% of this data at the time of publication.

#### NGS’s benthic lander platforms and tools

The National Geographic Society’s Exploration Technology Lab has been deploying versions of its autonomous benthic lander platform (the Deep Sea Camera System, DSCS) since 2010, collecting video data from locations in all ocean basins42. Between 2010 and 2020, the DSCS has been deployed 594 times, collecting 1039 h of video at depths ranging from 28 to 10,641 m in a variety of marine habitats (e.g., trench, abyssal plain, oceanic island, seamount, arctic, shelf, strait, coastal, and fjords). Videos from deployments have subsequently been ingested into CVision AI’s cloud-based collaborative analysis platform Tator73, where they are annotated by subject-matter experts at University of Hawaii and OceansTurn. Annotations are made using a Darwin Core-compliant protocol with standardized taxonomic nomenclature according to WoRMS78, and adheres to the Ocean Biodiversity Information System (OBIS79) data standard formats for image-based marine biology42. At the time of publication, 49.4% of the video collected using the DSCS has been annotated. In addition to this analysis protocol, animals have also been localized using a mix of bounding box and point annotations. Due to these differences in annotation styles, 2,963 images and 3,256 annotations using bounding boxes from DSCS has been added to the FathomNet database.

#### NOAA’S Office of Exploration and Research video data

The National Oceanic and Atmospheric Administration (NOAA) Office of Ocean Exploration and Research (OER) began collecting video data aboard the RV Okeanos Explorer (EX) in 2010, but only retained select clips due to the volume of the video data until 2016, when deck-to-deck recording began. As NOAA’s first dedicated exploration vessel, all video data collected are archived and made publicly accessible from the NOAA National Centers for Environmental Information (NCEI)80. This specialized access is dependent upon standardized ISO 19115-2 metadata records that incorporate annotations. The dual remotely operated vehicle system, ROVs Deep Discoverer and Seirios45 contains 15 cameras: 6 HD and 9 SD. Two camera streams, typically the main HD cameras on each ROV, are recorded per cruise. The current video library includes over 271 TB of data collected over 519 dives since 2016, including 39 dives with midwater transects. The data were collected during 3938.5 h of ROV time, 2610 h of bottom time, and 44 h of midwater transects. These data cover broad spatial areas (from the Western Pacific to the Mid-Atlantic) and depth ranges (from 86 to 5999.8 m). Ancillary vehicle data (e.g. location, depth, pressure, temperature, salinity, sound velocity, oxygen, turbidity, oxidation reduction potential, altitude, heading, main camera angle, and main camera pan angle) are included as metadata.

NOAA-OER originally crowd-sourced annotations through volunteer participating scientists, and began supporting expert taxonomists in 2015 to more thoroughly annotate collected video. In 2015, NOAA-OER and partners began the Campaign to Address Pacific Monument Science, Technology, and Ocean NEeds (CAPSTONE), which was a 3 year campaign to explore US marine protected areas in the Pacific. Expert annotations generated by the Hawaii Undersea Research Laboratory45 for this single campaign generated more than 90,000 individual annotations covering 187 dives (or 36% of the EX video collection) using VARS39. At the University of Dallas, Dr. Deanna Soper’s undergraduate student group localized these expertly generated annotations for two cruises consisting of 37 dives (or 7% of the EX collection) from CAPSTONE, producing 8165 annotations and 2866 images using the Tator Annotation tool73. These data have formed the initial contribution of NOAA’s data to FathomNet.

### Computation of FathomNet database statistics

Drawing several metrics from the popular ImageNet and COCO image databases22,23, and additional comparisons with iNat201725, we can generate summary statistics and characterize the FathomNet dataset. These measures serve to benchmark FathomNet against these resources, underscore how it is different, and reveal unique challenges related to working with underwater image data.

#### Aggregate statistics

Aggregate FathomNet statistics were computed from the entire database accessed via the Rest API in October 2021 (Figs. 4, 5). To visualize the amount of contextual information present in an image, we estimated the number of concepts and instances as a function of the percent of the full frame they occupy (Fig. 4a, b), with FathomNet data split taxonomically (denoted by x) to visualize how data breaks down into biologically relevant groupings. The taxonomic labels at each level of a given organism’s phylogeny were back-propagated from the human annotator’s label based on designations in the knowledgebase (Fig. S1). If an object was not annotated down to the relevant level of the taxonomic tree (e.g., species), the next closest rank name up the tree was used (e.g., genus). The average number of instances and concepts are likewise split at taxonomic rank (Fig. 4c). The percent of instances of a particular concept and how they are distributed across all images is shown in Fig. 4d.

#### Concept coverage

Coverage—an indication of the completeness of an image’s annotations—is an important consideration for FathomNet. Coverage is quantified as average recall, and is demonstrated over 50 randomly selected images at each level of the taxonomic tree (between order and species; Fig. S1) for a benthic and midwater organism, Gersemia juliepackardae and Bathochordaeus mcnutti, respectively (Fig. 5a). This is akin to examining the precision of annotations as a function of synset depth in ImageNet22. FathomNet images with expert-generated annotations at each level of the tree, including all descendent concepts, were randomly sampled and presented to a domain expert. They then evaluated the existing annotations and added missing ones until every biological object in the image was localized. The recall was then computed for the target concept and all other objects in the frame. The false detection rate of existing annotations was negligible, and was much less than 0.1% for each concept.

#### Pose variability: iconic versus non-iconic imagery

The data in FathomNet represents the natural variability in pose of marine animals, which includes both iconic and non-iconic views of the concept. A subject’s position relative to the camera, relationship with other objects in the frame, the amount it is occluded, and the imaging background are all liable to change between frames. By computing the average image across each concept, an image class with high variability in pose (or non-iconic) will result in a blurrier, more uniformly gray image than a group of images with little pose diversity (or iconic)22. We computed the average image from an equivalent number of randomly sampled images across two FathomNet concepts (medusae and echinoidiae) and the closest associated synsets in ImageNet (jellyfish and starfish), which is shown in Fig. 5b.

### FathomNet data usage and ecosystem

To grow the FathomNet community, we have created other resources that enable contributions from data scientists to marine scientists and ocean enthusiasts. Along with the FathomNet database, machine learning models that are trained on the image data can be posted, shared, and subsequently downloaded from the FathomNet Model Zoo (FMZ;63). Community members can not only contribute labeled image data, but also provide subject-matter expertise to validate submissions and augment existing labels via the web portal34. This is especially helpful when images do not have full coverage annotations. Finally, additional resources include code62, blogs60, and YouTube channel61, that contain helpful information about engaging with FathomNet.

### Estimating and contextualizing FathomNet’s value

The two most commonly used image databases in the computer vision community, ImageNet and COCO, are built from images scraped from publicly available internet repositories. Both ImageNet and COCO were built with crowd-sourced annotation via Amazon’s Mechanical Turk (AMT) service, where workers are paid per label or image. The managers of these data repositories have not published the collection and annotation costs of their respective databases, however we can estimate these costs by comparing the published number of worker hours with compensation suggestions from AMT optimization studies.The recommended dollar values a study generating computer vision training data81 and scientific annotations82 are in keeping with several meta-analyses of AMT pay scales, suggesting that 90% of HIT rewards are less than $0.10 a task and that average hourly wages are between$3 and $3.50 per hour83,84. The original COCO release contains several different types of annotations: category labels for an entire image, instance spotting for individual objects, and pixel level instance segmentation. Each of these tasks entails different amounts of attention from annotators. Lin et al.23 estimated that the initial release of COCO required over 70,000 Turker hours. If the reward was set to$0.06 per task, category labels cost $98,000, instance spotting was$46,500, and segmentation cost $150,000 for a total of about$295,000. ImageNet currently contains 14.2M annotated images, each one observed by an average of 5 independent Turkers. At the same category label per hour rate as COCO, the dataset required $$\sim$$76,850 Turker hours. Assuming a HIT reward of $0.06, ImageNet cost$852,000. These estimates do not include the cost for image generation, intellectual labor on the part of the managers, hosting fees, or compute costs for web scraping.

Fine-grained, taxonomically correct annotation is difficult to crowd-source on AMT57. The initial release of FathomNet annotations thus rely on domain expert annotations from the institutions generating the images. The annotation cost for MBARI’s Video Lab for one technician is $80 per hour. Expert annotators require approximately 6 months of training before achieving expert status in a new habitat, and the annotator will continue to learn taxonomies and animal morphology on the job. The bounding boxes for FathomNet require different amounts of time in different marine environments; midwater images typically have fewer targets, and benthic images can be very dense. Based on the Video Lab’s initial annotation efforts, an experienced annotator can label $$\sim 80$$ midwater images per hour for a$1 per image cost. The same domain experts were able to label $$\sim 20$$ benthic images per hour or about $3 per image. The 66,039 images in the initial upload to FathomNet from MBARI are approximately evenly split between the two habitats, costing $$\sim$$$165,100 to generate the annotations. At this hourly rate, ImageNet would cost $$\sim$$ $6.15 M to annotate. We believe these costs are in-line with other annotated ocean image datasets. True domain expertise is expensive and reflects the value of an individual’s training and contribution to a project. In addition to the intellectual costs of generating FathomNet, ocean data collection often requires extensive instrument development and many days of expensive ship time. To date, FathomNet largely draws from MBARI’s VARS database, which is comprised of 6190 ROV dives and represents $$\sim$$$143.7 M worth of ship time. Including these additional costs underscores the value of FathomNet, especially to groups in the ocean community that are early in their data collection process.