Abstract
Introduction. This document provides an overview of an archive composed of four sections.
[1] An introduction (this document) which describes the scope of the project
[2] Yearly folder, from 2002 until 2010, of the coarse Microsoft Access datasets + the surveys used to collect information for each year. The word coarse does not mean the information in the Microsoft Access dataset was not corrected for mistakes; it was, but some mistakes and inconsistencies remain, such as with data on age or education. Furthermore, the coarse dataset provides disaggregated information for selected topics, which appear in summary statistics in the clean dataset. For example, in the coarse dataset one can find the different illnesses afflicting a person during the past 14 days whereas in the clean dataset only the total number of illnesses appears.
[3] A letter from the Gran Consejo Tsimane’ authorizing the public use of de-identified data collected in our studies among Tsimane’.
[4] A Microsoft Excel document with the unique identification number for each person in the panel study.
Background. During 2002-2010, a team of international researchers, surveyors, and translators gathered longitudinal (panel) data on the demography, economy, social relations, health, nutritional status, local ecological knowledge, and emotions of about 1400 native Amazonians known as Tsimane’ who lived in thirteen villages near and far from towns in the department of Beni in the Bolivian Amazon. A report titled “Too little, too late” summarizes selected findings from the study and is available to the public at the electronic library of Brandeis University:
https://scholarworks.brandeis.edu/permalink/01BRAND_INST/1bo2f6t/alma9923926194001921
A copy of the clean, merged, and appended Stata (V17) dataset is available to the public at the following two web addresses:
[a] Brandeis University:
https://scholarworks.brandeis.edu/permalink/01BRAND_INST/1bo2f6t/alma9923926193901921
[b] Inter-university Consortium for Political and Social Research (ICPSR), University of Michigan (only available to users affiliated with institutions belonging to ICPSR)
http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/37671/utilization
Chapter 4 of the report “Too little, too late” mentioned above describes the motivation and history of the study, the difference between the coarse and clean datasets, and topics which can be examined only with coarse data.
Aims. The aims of this archive are to:
· Make available in Microsoft Access the coarse de-identified dataset [1] for each of the seven yearly surveys (2004-2010) and [2] one Access data based on quarterly surveys done during 2002 and 2003. Together, these two datasets form one longitudinal dataset of individuals, households, and villages.
· Provide guidance on how to link files within and across years, and
· Make available a Microsoft Excel file with a unique identification number to link individuals across years
The datasets in the archive.
· Eight Microsoft Access datasets with data on a wide range of variables. Except for the Access file for 2002-2003, all the other information in each of the other Access files refers to one year. Within any Access dataset, users will find two types of files:
o Thematic files. The name of a thematic file contains the prefix tbl (e.g., 29_tbl_Demography or tbl_29_Demography). The file name (sometimes in Spanish, sometimes in English) indicates the content of the file. For example, in the Access dataset for one year, the micro file tbl_30_Ventas has all the information on sales for that year. Within each micro file, columns contain information on a variable and the name of the column indicates the content of the variable. For instance, the column heading item in the Sales file would indicate the type of good sold. The exact definition of the variable is in the Word document and, often, in the Access file as well. The rows of a micro file inside the Access dataset represent the entity measured: villages, households, or individuals.
o Look-up files. The name of a look-up file is prefixed with tlk. These are stand-alone files with codes for some topics. For instance, the look-up file “tlk plants” has the numerical code for a plant and its name in Spanish, Tsimane’, or both. The look-up file might be used in several thematic files within and across years. For example, data on sales and expenditures might each refer to “tlk plants” if people bought (expenditure file) or sold plants (sales file). The look-up tables were built gradually upon earlier versions, with later versions maintaining the codes of earlier version. If garlic was coded as 402 in the first survey year, it was still coded as 402 in 2010. The look-up tables for the last year of the study (2010) are more comprehensive than the look-up tables for earlier years.
· Spanish surveys are in Microsoft Word. In each year we used at least three core separate survey to collect data on villages (e.g., prices), households (e.g., area farmed), and individuals (e.g., sales, anthropometrics). In some years we include stand-alone surveys that were part of, say, the individual and household module; these stand-alone surveys have been folded into the Access dataset and the Word document used in the survey is also included. There is a direct correspondence between modules in the Word version of the survey and the Access dataset. For example, a section of the Word survey might include the heading Agriculture (35_tbl_agriculture), followed by all the questions about agriculture. In the Access dataset, there will be a folder “35_tbl_agriculture” that should mirror and contain the same questions as in the survey, with answers to each question put in a different column.
· One Excel file named Deidentified_unique_subject_identification_numbers. This identification number is unique to the person, doesn’t change between years, and is needed to track the changes of a person over the nine years of the study.
Topics to keep in mind. All the data was collected before the advent of electronic devices for gathering and checking survey data in the field in real time. As a results, there are mistakes in the coarse datasets and probably in the clean datasets as well. Documentation becomes simpler and clearer for more recent years. Below are listed some topics to keep in mind as the data is used:
· Trust the wording of the survey question in Word, not Access. We are including Microsoft Word documents of the surveys used each year. There were usually three surveys, one for a city of towns (ciudad), one for communities (comunidad), and one for individuals-households combined (hh_ind). Users of the dataset will note that the wording of question will sometimes differs between the Word document and the Access datasets. When this happens, trust the Word document since this is what surveyors used.
· The convention to develop a unique identification number for villages, households, and individuals in a year. The convention was as follows:
o Village: Each village was assigned a unique identification number of at most three digits. The look-up Access file called “tlk comunidades” is included each year and has the name and numerical code for each village. There are more villages in the look-up Access file because some of the excess villages were part of other studies. The name of the variable with a unique village identification number sometimes varies between years but is often called “vid” or “VillCode” and is typically among the first 1-3 variables listed in a village-level survey and micro-level file for villages within the Access database for a year. Although the name of the variable for a village might vary between years, it is the same within a year; for instance, all files with community-level variables for 2008 will have “vid” to identify the village.
o Household: This is a unique identification number for a household in a yearly survey and is composed of up to five digits, the first three digits refer to the community (as described above) and the last two digits are unique to the household. Example: In community 15, the third and 20th household would be coded as household 1503 or as household 1520. These codes would appear in any file that contains household-level information (e.g., agriculture, house quality). What was said about inconsistencies in the naming convention of villages also applies to the naming convention of the variable identifying a household. In some years, the household identification number might be called HHId, in other years hhid
o Individuals in a year: Two digits are added to the unique identification number of a household to arrive at the unique identification number of an individual. From the unique subject identification number in a year one can extract the unique village and unique household identification numbers. The unique identification number of a person in one year can be summarized as a number with up to 7 digits, VVVHHII (V = village, H = household, I = individual). Remove the last two digits of the subject unique identification number and you are left with VVVHH, the unique identification number of a household. Remove the last four digits of the subject identification number (or the last two digits of a unique household identification number) in a year and you are left with the unique village identification number. The unique identification number of an individual irrespective of survey year is discussed later.
o Identifying household heads and gender from the yearly identification number of individuals. The last digit of a subject’s identification number in a year indicates the person’s gender: odd numbers are reserved for males and even numbers for females. Household heads are identified with the last two digits: 31 = male household head and 30 = female household head. This convention allows one to use the subject identification number in one year to identify the gender of a person and who are the household heads; the convention should not be used with the unique fixed subject identification number constant across years for a subject.
· Files for topics measured among entities at one level do not necessarily have the unique identification number of higher-level entities in which lower-level entities were nested. For example, a file on anthropometrics measured at the level of the individual will have the unique identification number for the person but will not necessarily have the household or village identification number for the individual. These last two identification numbers can be generated by extracting the appropriate digits from the identification number of the individual following the rules discussed above.
· Inconsistencies between the unique identification number in the dataset and the computed identification number. Suppose a dataset has both a unique identification number for the individual and for the household. What happens if one retrieves the unique identification number for the household from the subject identification number and one finds that the household identification number you created differs from the household identification number included in Access? In these cases, it is probably safer to trust the household or village identification number you generate. .
· Inconsistency in naming convention of the same thematic file between years. Example: One year, a file in Access might be called “Wealth” and in another year “Riqueza”. Despite the difference in names, the files contain the same type of variables, which you can verify by examining the survey document in Word.
· Data for 2002-2003: Quarterly. For the years 2004-2010, inclusive, data comes packaged discretely by year in a stand-alone file, with one Microsoft Access file for each year. Data for 2002-2003 was part of a five-quarter mini panel study that extended from May 2002 until August 2003. Once each quarter, surveyors collected data from individuals, households, and villages. These were the same entities surveyed during 2004-2010. Without attrition or new entrants, the dataset should contain five observations per individual. The quarterly dataset for 2002-2003 has not been split by year in this Archive. One can use this dataset to examine:
o Quarterly or seasonal changes. The variable for quarter in the Access dataset has mistakes so it is safer to use the date variable called “Fecha” or “RecordDate” and create the variable for quarter to examiner quarter-to-quarter or seasonal changes. None of the 2004-2010 surveys allow analysis of seasonal changes since information was typically collected during the dry season (roughly from May until August/September).
o Yearly changes. This approach would be appropriate if a user wished to append this quarterly dataset to the 2004-2010 yearly datasets. One thing to keep in mind as you reshaped 2002-2003 quarterly data to create a yearly panel for 2002-2010 is the following: Select one observation per subject for 2002 and one observation per subject for 2003. Since a subject was surveyed multiple times in each year, it is best in some cases to select subjects surveyed during May-August/September of each year because yearly surveys for 2004-2010 took place during the dry season (May-August/September). In this way, one could rule out survey timing for any difference in results between 2002-2003 and later years
o Time allocation. This is the only time we used scans or spot observations of behavior. All this information is in a file called “Barridos”.
· Age and education. The variables for age and schooling have a lot of mistakes because many Tsimane’ do not know their age and often guessed or rounded answers when responding. Chapter 5 of “Too little, too late” discusses these issues. In the clean dataset you will find improved versions for these variables. See Appendix C of Chapter 4 for how the improved versions for the variables age and education were created.
· Missing or poorly documented files. There is at least one case of a topic (perceptions of climate change) where information was collected in the survey, but the data never made it to the Access file. And there are topics like CRP and parasitic infections where the data appears in ACCESS, but the documentation is absent because the data was sometimes generated by laboratories after the survey. For questions about CRP data contact Professor T. McDade or W. Leonard (Department of Anthropology, Northwestern University) and for questions about parasites contact Professor Susan Tanner (Department of Anthropology, University of Georgia).
Linking files in a year and across years.
· In a year. Once the numbering conventions for subject, household, and village identification is in order, it is straightforward to merge files within a year.
· Across years. This procedure is slightly more complicated for a couple of reasons.
o The name and definition of variables might have changed between years. Rename variables and note differences in meaning between years if the name or the definition of the variable changed between years before appending files across years. The Word document of the surveys should help identify and clarify differences in definition and naming for variables between years. Recall also, that many variables were only measured in some years, so there will often be missing values for these variables once the files have been appended.
o Unique identification number of subjects. If you are interested in tracking the same individual over time, use the unique subject for a person in the Excel file Deidentified_unique_subject_identification_numbers. Since the identification number of a subject in a year could be different in other years, linking a subject’s identification in a year (in the coarse datasets) to their unchanging identification number across years will require some programming expertise, but the raw information you will need is in the Excel file.
Complementary datasets archived elsewhere. During and after the longitudinal study, the research team did research with Tsimane’ in other communities and with other indigenous groups. The complementary datasets and where they can be found are shown below:
[a] Randomized control trial on village income inequality. The coarse and clean datasets are at the following address of the electronic library of Brandeis University:
The clean dataset of this RCT was submitted to ICPSR for review and storage in December 2022, but at the time of this writing (March 2022) has not yet been approved and made available to the public.
[b] The effect of roads on well-being among indigenous populations in the Bolivian Amazon. The coarse and clean datasets are available to the public at the following address of the electronic library of Brandeis University:
The clean dataset of this study was submitted to ICPSR for review and storage in March 2022.
Approval. The Gran Consejo Tsimane’, the governing body of Tsimane’, approved making this and other datasets from other studies among Tsimane’ publicly available provided datasets were de-identified. A PDF of the letter of consent from the Gran Consejo Tsimane’ is included with this archive.