This section is really a hybrid between a real Glossary and a FAQ that
intends to explain some of the terms and the meanings as used in the
building of this ranking.
Database size. The number
of records in the search engine databases that it
publicly accessible from external sources. Not all
the robots crawl the Web at the same time or with
identical procedures, besides post crawling processes
and other commercial requirements finally result in really
different databases. The current size, composition
and evolution of the figures are a relevant point
in webometric analysis.
Delimited search. A key characteristic
of the search engines that allow the cybermetric
analysis. A delimiter operator has a specific syntax
and meaning that can differ among engines. It provides
the number of records (web pages) that satisfied
a certain condition filtering the results according
to strings in the address (URL) or other characteristics
(language, format) of the page. Special relevance
has the link delimiter that can be used in combination
with site or other similar to calculate inlinks.

Discipline differences. The
ranking does not provide any kind of thematic assignation
to the units, so a formal thematic analysis is not
possible at the moment. But there are important differences
regarding academic focus on our universities database
that should be taken into account. Research focused
universities are mixed with learning institutions
and a group of discipline oriented (mainly pedagogy,
medicine and theology) organizations are also present.
Formal characteristics. As
there is neither universal document control nor formal
guidelines for web page building, there is a huge
diversity of formal aspects in the Webspace, including
obvious malpractices. Some authors have focused on
these to provide new indicators such as link density,
link quality, expressed as ratios of non working
links, missing tags, including those so relevant
as title or metadata, or updating frequency. None
of these characteristics are taken into account in
our rankings, but they should be taken into consideration
for micro-analysis.
Geographical biases. The
use of several search engines in our ranking is due
to the geographical bias observed in some of them.
We do not know if this is due to topological or traffic
problems in the network (some eastern Asian countries
are usually poorly covered) or to the crawlers behaviour
or if the biases are equal long the time. Alexa biases
preclude us to add the popularity data in our rankings.
Institutional domains. The
basic unit of our analysis refers to the common URL
domain shared by all the web sites of an institution.
Unfortunately some organizations maintain two or
more equivalent domains, without a preferred marked
one. Also for concern is the fact that some second
level departments maintain completely different domains.
Usually we maintain two entries for those institutions
with two top level equivalent domains. We intend
to merge results of smaller domains with those of
the main one in the near future, but it is a difficult
task.
Invocation. The presence
of the name of an institution or a researcher in
a Web page. The global presence is the number of
times the name appears in the Web and can be calculated
easily using quotation marks around the name in the
search engines. Sometimes this figure is referred
as the number of times this name is cited in the
Web. Some authors refer this as Web visibility, although
we prefer to reserve this word for link visibility.
This indicator usually favours large, well-known,
old institutions independently of their real effort
for having a relevant Web presence.
No invocation measure was used in our ranking, mainly because it is not
possible to assign a unique, unambiguous universal name for every institution.
Invisible Web. Traditionally
refers to the information available through gateways
or search interfaces that is not accessible by the
search engines’ robots. It is a huge part of
the Internet content, including library catalogues,
bibliographic and alphanumeric databases or even
some repositories of documents. During last years
some engines, specially Google, has made a great
effort to index these records and in fact several
databases are more or less covered in their systems
(i.e. PubMed is partially indexed by Google). Our
ranking do not consider the Invisible or Deep Web
and we encourage transforming it in crawler friendly
information.
Language. English is the “lingua
franca” for scientific communication and it
is also the language of a significant fraction of
the internet users. Non-english institutions publishing only
in their mother tongue alone achieved a lower visibility
than those with multilingual websites.
Link motivation. Major concern
in link analysis is the motivations behind a link
creation. Previous studies suggest that “sitations”,
the hypertextual equivalent to bibliographic citations,
are still rare. We think this situation will improve
when more papers became available on the Web, but we
consider other reasons to link very useful to describe
scholarly communication. Informal linking is a powerful
source of information about intellectual, economic
and political connections of the academic and scientific
activities.
| CATEGORY |
CASE |
COMMENTS |
| Sitation |
Link to paper or document |
Generally in pdf/ps/doc format |
| Teaching/learning |
Link to course materials |
Mainly html pages but also pdf, doc or ppt |
| Research oriented |
Resources index |
Portal type |
| Software repository |
|
| Research projects sites |
|
| Conferences, seminars or meetings pages |
|
| Raw data |
Including media files if applicable |
| Personal |
Self archive |
Pre or post prints, but also unpublished material |
| Team or colleagues pages |
|
| Blog |
|
| Third parties (non-research) |
|
| Institutional |
Parent institution |
And related ones |
| Funding organization |
|
Link popularity. Another
term to refer to link visibility that has been used
extensively. We prefer to reserve popularity for
the measure of number of visits. Although not yet
implemented on the Ranking, we intend to consider
number of visits or popularity as a relevant factor
for our rankings in the future.
Open access. The movement
to distribute in an open way the scientific production
of, at least, the public funded researchers is facing
tougher opposition than expected. A strong bet for
open access initiatives will be clearly reflected
in our rankings.
Personal pages. A frequently
heard statement about web contents quality is related
to the information provided by the personal pages
of students or staff members. There is a lot of free
space hosted by the university web servers that is
used for personal purposes, and in general it is thought that
it is used with low quality information or not academic related.
Data suggest a large number of small websites are
crowding the institutional domains, but most of them
are interesting enough to merit consideration. Some “personal” pages
are in fact the research group site, while others
are institutional (scientific societies, electronic
bulletins, conference sites). True personal pages
cover both extremes of the contents range, with people
offering only CVs to others providing very large
arrangements of information of their academic or
research topics with links to personal repositories
of documents. A striking pattern is the absence of
links to other colleague’s websites or institutions.
Quality. We advice against
the use of the rankings as global or partial indicator
of quality. Impact or visibility describes better
our aims, but in the particular context of promotion
of open and universal access to the scientific activities
and results through the Web.
Ranking. As their main objective
is purely commercial, current search engines are
not offering stable, reliable, or trustworthy results
for webometric purposes. The situation has improved
in the last years but there are still important bias
and a worrisome instability. This is the reason we
are using absolute values but relative positions
for our analysis.
Rich files. A general term
comprising a rather heterogeneous group of file types,
mainly those devoted to represent unitary enriched
documents, such as MS Word doc, Adobe Acrobat pdf
or PostScript ps. In our analysis we also included
MS Powerpoint ppt and excluded xls or latex or tex.
Rich files are relevant because they are use for
scholarly communication as authors usually distribute
their papers and presentations in these formats.
Certainly some of these types are used extensively
for bureaucratic purposes (forms, administrative
documents, internal reports) but these can only explain
a small percentage of large numbers observed in domains
with extensive repositories.
There are several other file types that can be considered as rich files,
and even raw formats like txt are being used for distributing academic
content. But their individual contribution is too low to be considered.
Rounding. Google and Yahoo
offer rounded results, ending in ,000,
which means an error rate in the order of 2 to 5%.
Moreover the numbers provided by Yahoo in the first
page is about another 4-5% higher that the one showed
in the following pages that show a trend towards
the “correct” number.
Search Engine. The software
that searches an index and returns matches. Search
engine is often used synonymously with spider and
index, although these are separate components that
work with the engine. There are only four engines
useful for quantitative analysis purposes as they
have a large and independent self crawled database
and their recovery system allow filtering of results
according to url-related delimiters:
Google www.google.com
Yahoo Search search.yahoo.com
Bing www.bing.com
Exalead www.exalead.com/search
Self archiving. Self-archiving involves depositing a free copy of a digital document on the World Wide Web in order to provide open access to it.
The term usually refers to the self-archiving of peer reviewed research journal and conference articles as well as theses, deposited in the author's own institutional
repository or open archive for the purpose of maximizing its accessibility, usage and citation impact. This practice is common among most prolific authors and in certain
disciplines. However globally it is only a minority of authors who support this option. As much of these papers are published as rich files, pdf, ps or doc, this practice
increases notably the performance of an institution in our rankings.
Size. The size of an institutional
domain is the combined number of pages of all the
websites with that domain, including html and non
html formats that can be assimilated. From a practical
point of view, size refers to the number provided
by a search engine when a search like site:domain
is done. This indicator is central for our rankings
and it is used also as denominator for Web Impact
Factor calculations by other authors. However there
is a wide range of pages according to different criteria,
including content size measured in bytes. For example,
a page containing a pdf document that can be a monograph consisting of several
hundreds pages totalling several Mb of texts and
images, while other page consists only of the phrase “page
under construction”. Global size could be an
interesting indicator and we expect to provide it
for selected websites.
Stability. From the early
times instability of the search results in general,
and of the number that represents results in particular
has been a subject of special concern. Certainly
the Web is a highly dynamic system, growing at an
incredible pace, but also the crawlers change their
specifications and schedule unexpectedly. A world
crawling round can last from 15 to 45 days and in
this meantime.
Visibility. In the context
of this ranking, the term refers to link visibility:
The number of external inlinks received by an institutional
domain. The most used syntax for this request in
search engines is:
linkdomain:webometrics.info –site:webometrics.info
Web cost. Maintain a very
large presence on the Web can be quite costly;
including specific funding and human resources, but
the total cost is far below any other publication
method and the potential audience is truly global.
A way to undertake large projects is distributed
effort, so individual graduate students, professors
or researchers, scientific teams and other administrative
units have an autonomous web presence. A rich content
page should include a large diversity of objects
including images and other media files, certain amount
of navigational links and a selected group of external
outlinks. That can require a huge effort that can
be only face if theses tasks are subject of evaluation
as other academic and scientific activities.
Web Impact Factor. The most
cited cybermetric indicator, although its usage is
not universal due to several shortcomings. It is
the defined as the ratio between the external inlinks
received by a website and the number of webpages
comprising that website. Some authors suggested modifications
to the denominator, using different alternative measures
for the size of the institution using non-internet
data such us number of potential authors (staff,
professors, graduate students), economic wealth (funding,
projects) or bibliometric data (papers in journals).
Our ranking is derived from WIF in which a ratio 1:1 is established between
visibility and size. |