GEOGRAPHIC AND DISCIPLINARY DISTRIBUTION OF THE BRAZILIAN'S PHD COMMUNITY: PATTERNS OF THE SCIENTIFIC COLLABORATION STRUCTURE

The study of national academic characteristics is an imperative task for the understanding of national scientific production and the creation of effective science policy. Using a dataset of more than 3.2 million Brazilian curricula, we explore the academic community of PhDs working in Brazil in order to identify characteristics of the whole national network and in the knowledge area level. We used metrics from social network analysis and text mining techniques, as well as the patterns of collaboration between areas and the regional distribution of PhDs. The results show different general characteristics of the PhDs working in each Brazilian state and knowledge area, according to the social and economic characteristics of each of the five Brazilian regions. Different interaction profiles were described, like a less connected network in Linguistics, Letter, and Arts, in which each researcher is related, on average, to less than three other PhDs; on the opposite side, Agricultural Sciences each researcher is related, on average, to more than nine other PhDs of the network. It is clear that besides the capital and one or other major city, the Northeast Region is devoid of PhDs, a situation that is particularly problematic for the most destitute region of Brazil.


Introduction
The assessment of the academic social network from any country is a complex task that involves the treatment of a large volume of data and requires different kinds of studies. Brazil is the fifth largest country in the world and the fifth most populous. Additionally, the country is composed of five major regions -North, Northeast, Central West, Southeast, and South -, which present great differences, regarding economic and cultural issues, level of literacy and entry into the higher education, history of occupation and colonization, etc.
For example, the Northern Region is mostly occupied by the Amazon Forest, it has a large portion of its population composed of indigenous people and their descendants, and has social and One real obstacle, considering assessing the full scientific production of any given country is to gather hundreds of information sources, considering the diversity expressed by documental typology, disciplinarity and scope (national or international). However, this challenge can be diminished given the existence of a national database of academic curricula, which is managed by a system called Lattes Platform. It congregates information about research groups, academic curricula, and academic institutions.
Scientific collaboration is influenced by several factors such as geographical proximity and relationships established among researchers. In addition, although studies on scientific collaboration often focus on co-publication, scientific collaboration should be considered beyond research results (Vanz and Stumpf, 2010). It permeates the whole cycle, from its conception (when, for example, a process of supervision begins), as well as its development in the execution of a project.
The information needed to establish the different kinds of relationships is typically dispersed.
In this sense, the information provided in the academic curricula helps to overcome various obstacles: the problem of homonym, the dispersion of bibliographic output on different sources of information, the current affiliation of the researcher, the advisor-advisee relationship, the areas of interest, as well as collaboration with other researchers.
Lattes Platform is an online system maintained by the Brazilian National Council for Scientific and Technological Development (CNPq) to congregate academic and professional information from researchers that work in Brazil. The Lattes database records the past and current lives of Brazilian researchers, being used not only by CNPq but also by other federal and state institutions and development agencies.
Today, there are more than five million curricula registered in Lattes Platform, and many studies use information from this database as the primary source of research. However, there are some Digiampietri, Luciano Antonio.; Mugnaini, Rogério; Trucolo, Caio; Delgado, Karina Valdivia; Mena-Chalco, Jesus Pascual; Köhler, André Fontan. Geographic and Disciplinary Distribution of the Brazilian's PHD Community: patterns of the scientific collaboration structure. // Brazilian Journal of Information Science: Research Trends. 13:4 (2019) 113-131. ISSN 1981113-131. ISSN -1640 challenges in the utilization of this dataset: (a) it is not possible to download the entire database, instead each curriculum must be individually downloaded in HTML or XML format; (b) there is a lack of standardization in many fields which are manually filled by the owner of the curriculum; (c) more than half of the curricula were last updated more than one year ago; and (d) there is an enormous volume of information (the total space occupied by the XML files is above 140 gigabytes, there are millions of curricula, dozens of millions of registers of publications, tens of millions of registers of academic degrees, millions of registers of interest in knowledge areas, and so forth).
This paper aims to describe, analyze and evaluate the Brazilian scientific/academic community, through its professors, researchers and other professionals, with at least the title of Ph.D., from two dimensions. We describe the distribution of PhD holders over areas of interest as well as over the five major Brazilian regions; the data were analyzed and can be visualized for the federal units as well as for the concentration of PhDs in the main cities. The assessments are based on well-known metrics from social network analysis (Wasserman and Faust 1994;Camacho, Kim and Trawinski, 2015).
The remainder of this paper is organized as follows: next subsection contains a brief review of the related work; Section 2 presents the methodology and Section 3 presents the findings and discussion; Section 4 brings the limitations of the research; and Section 5 provides the conclusions.

Related Work
In the last years, several studies analyzed the Brazilian academic social networks on a micro level. There are still relatively few studies that assessed the whole country academic social networks composed of researchers who work in different knowledge areas. Some of them (Melo, 2011;Digiampietri et al., 2014;Mena-Chalco, 2013;Mena-Chalco et al., 2014;Tuesta et al., 2015;Lima et al., 2015;Silva et al., 2017;Dias and Moita, 2018;Damaceno et al., 2019) use the information registered in the Lattes Platform. Tuesta et al. (2015) examine the advisor and advisee relationship for the researchers who are involved in the area of Exact and Earth Sciences in Brazil and its eight subareas. The authors identified a positive correlation between the time of cooperation of the advisee and the advisor and the productivity of the advisees. Additionally, they analyzed the gradual decrease in intellectual dependence between the advisor and the advisee. Damaceno et al. (2019)  Based on the curricula registered in the Lattes Platform, the authors were able to draw a broad overview of the advisor-advisee relationship and to measure the level of interdisciplinarity between areas of knowledge, among other points. Mena-Chalco et al. (2014) analyzed the Brazilian coauthorship network built using data from more than one million of curricula from Lattes Platform. The authors analyzed the network using different graph metrics and constructed subnetworks according to the knowledge areas declared in the curricula. They assessed the structure of these networks and the social behavior of the researchers in the different areas.
In a PhD Thesis, Melo (2011) characterized the Brazilian scientific community considering three aspects: productivity, internationalization, and visibility. The author examined the curricula published in the Lattes Platform of 51,080 PhDs that are participants of research groups (the groups are also registered in the Lattes Platform). The aspect of internationalization was further studied by Mugnaini, Leite and Leta (2012), comparing internationalization profile between two subgroups of the 51,080 PhDs in Lattes Platform: those who published at least one article in Web of Science journal, and those whose name were not present in that database.
According to Dias and Moita (2018), although individuals with PhDs constitute only 5.38% of the total curriculum registered in the Lattes Platform, they are responsible for 74.51% of journal papers and 64.67% of communications published in annals of technical-scientific events. Besides, PhD individuals tend to have more up-to-date curricula and to have at least one registered publication.
Dias and Moita (2018) also draw attention to the fact that PhDs are responsible for the masters and doctorate advisoring in the stricto sensu graduate programs in Brazil.
Another information source used in academic social network analysis around the world is the DBLP, a platform that provides bibliographic information on major computer science journals and proceedings that contain information of more than 2.3 million papers. Freire and Figueiredo (2011) used data from DBLP to analyze the Brazilian coauthorship network. They grouped the researchers according to the graduate program where they worked and compared the network metrics with a ranked attributed by the Brazilian Coordination for the Im-

Methodology
The present work is divided into five activities: (a) data gathering; (b) sample selection; (c) information extraction; (d) metrics' calculation; and (e) analysis of the results.
Data Gathering: All data used in this work were extracted from the Lattes Platform curricula.
On July 2013, the search service of this platform was queried in order to retrieve the identifiers from all the curricula. About 3.2 million identifiers were retrieved, and all the XML files from the respective curricula were downloaded. Applied Social Sciences, Engineering, and Linguistics, Letters and Arts). These metrics are useful to understand the structure and behavior of the academic and professional communities composed of the PhDs from each of the knowledge areas. All the networks produced correspond to undirected and unweighted graphs, in which each node corresponds to a PhD and each edge (link) corresponds to a co-authorship relation between two PhDs.
Analysis of the Results: The academic social networks were comparatively analyzed as well as the patterns of collaboration between areas and the regional distribution of PhDs in Brazil.

Findings and Discussion
In the Lattes Platform, it is possible to register from zero to six areas of interest (or areas of expertise). From the 156,421 PhDs studied in this paper, 6,511 did not register any area of interest; 103,378 had registered only one area of interest, and 46,532 registered more than one area of interest.        Table 3 (appendix) presents the networks' metrics. Each metric is described as follows, along with a discussion of the metric values for each of the analyzed networks.
The number of nodes corresponds to the number of PhDs in each network, and the number of edges corresponds to the number of relationships (coauthorship, advisee-advisor, and collaboration in scientific projects) between PhDs. In these networks, there are several nodes with degree zero, i.e., nodes without any relationship. The column "Nodes with degree greater than zero" presents the number of PhDs who have at least one relationship in the network, only these PhDs (nodes) were used in the calculation of the other metrics in Table 3.  path length was found in the Agricultural Sciences' network (4.77); on the other hand, the highest was found in the Humanities' network (7.55).

Figure 4 presents the graph of the giant component of the whole network (considering just the
PhDs that registered only one area of interest). It is possible to note concentrations of nodes for the majority of the knowledge areas. It is also possible to observe the mixing of nodes from different areas (different colors) in some regions of the graph and, specifically with more intensity, in the borders between areas. For example, between Agricultural Sciences (red nodes) and Biological Sciences (cyan nodes), and between Engineering (light green nodes) and Exact and Earth Sciences (orange nodes).  Of the eight areas of interest, Linguistics, Letters and Arts is the most singular. The combination of the lowest percentage of nodes in the giant component with the lowest average degree shows an area in which the PhDs still usually work alone or with few collaborators. This notion of atomization of research is reinforced by the fact that the maximum click size is just 5, the lowest of all areas of interest. In Figure 4, Linguistics, Letters and Arts appears on the periphery of the scientific community; this is reinforced by the small size of this area of interest in Brazil.

Limitations
The research presents three main limitations. First, it focuses in a single country -Brazil; this limits its applications and conclusions. Second, as presented, the data extraction was made in 2013; it happened because the Lattes Platform does not permit, nowadays, the same kind of extraction that was made. Moreover, this research is part of a broad study that took years to complete. Third, the data collected in the Lattes Platform is inserted by professors, researchers and professionals themselves; there is no guarantee that the data is accurate and correct. Yet past studies point that the data is in general reliable.

Conclusions
This paper analyzed the Brazilian Academic Social Network composed of the PhDs whose professional address is in Brazil.
Nine networks were constructed and analyzed. One corresponding to all the PhDs working in Brazil and one for each of the eight main knowledge areas. It was possible to observe a great heterogeneity in the distribution of PhDs according to their area of interest. This distribution follows the social and economic characteristics of each of the five Brazilian regions.
As expected, there is a greater concentration of PhDs in the capitals of the federal units, as well as in their major cities, with rare exceptions.
It was also observed the interaction between researchers of different knowledge areas. There is a strong relationship between researchers from the same area and a smaller interaction, but also important, among researchers from related areas (such as Applied Social Sciences and Humanities, or Engineering and Exact and Earth Sciences). Different interaction profiles were assessed in the networks. Less connected networks were observed in areas such as Linguistics, Letter, and Arts in which publications with few authors are common and each researcher is related, on average, to less than three other PhDs. On the other hand, much more connected networks could be found in areas such as Agricultural Sciences in which each researcher is related, on average, to more than nine other PhDs of the network.
Finally, there are significant differences between the regional distribution and the predominance of areas of interest between the regions and states of Brazil. For example, it is clear that besides the capital and one or other major city, the Northeast Region is devoid of PhDs, a situation that is particularly problematic for the most destitute region of Brazil.
As future work, we intend to develop a study about the research subjects extracted from the papers published by PhDs working in Brazil, including the analysis of the temporal evolution of collaborations as well as the main research topics. We also intend to investigate the PhDs that declared to work in more than one knowledge area as well to investigate in detail interdisciplinarity collaborations.