whether a text belongs to a certain category.
These algorithms sometimes make reference to the
position of a word within a text or its relative proximity
to other words. The weights are usually determined
by machine learning methods (often described as “artificial
intelligence"). In this procedure, which resembles
an automated form of trial and error, a system is given
a “training set” consisting of documents
preclassified into two or more groups, along with
a set of features that might be potentially useful
in classifying the sets. The system then “learns”
rules that assign weights to those features according
to how well they work in classification, and assigns
each new document to a category with a certain probability.
Notwithstanding their “artificial intelligence”
description, automated text classification systems
are unable to grasp many distinctions between types
of content that would be obvious to a human.
And of critical importance, no presently conceivable
technology can make the judgments necessary to determine
whether a visual depiction fits the legal definitions
of obscenity, child pornography, or harmful to minors.
Finally, all the filtering software companies deposed
in this case use some form of human review in their
process of winnowing and categorizing Web pages, although
one company admitted to categorizing some Web pages
without any human review. SmartFilter states
that “the final categorization of every Web
site is done by a human reviewer.” Another
filtering company asserts that of the 10,000 to 30,000
Web pages that enter the “work queue”
to be categorized each day, two to three percent of
those are automatically categorized by their PornByRef
system (which only applies to materials classified
in the pornography category), and the remainder are
categorized by human review. SurfControl also
states that no URL is ever added to its database without
human review.
Human review of Web pages has the advantage of allowing
more nuanced, if not more accurate, interpretations
than automated classification systems are capable
of making, but suffers from its own sources of error.
The filtering software companies involved here have
limited staff, of between eight and a few dozen people,
available for hand reviewing Web pages. The
reviewers that are employed by these companies base
their categorization decisions on both the text and
the visual depictions that appear on the sites or
pages they are assigned to review. Human reviewers
generally focus on English language Web sites, and
are generally not required to be multi-lingual.
Given the speed at which human reviewers must work
to keep up with even a fraction of the approximately
1.5 million pages added to the publicly indexable
Web each day, human error is inevitable. Errors
are likely to result from boredom or lack of attentiveness,
overzealousness, or a desire to “err on the side
of caution” by screening out material that might
be offensive to some customers, even if it does not
fit within any of the company’s category definitions.
None of the filtering companies trains its reviewers
in the legal definitions concerning what is obscene,
child pornography, or harmful to minors, and none
instructs reviewers to take community standards into
account when making categorization decisions.