Repository for Data 441
The data for Gabon came from the 2000 DHS survey, completed by Direction Générale de la Stat. des Etudes Economiques. The data for the households has 6203 observations, with 1172 columns of data. Some of the data I used were:
Each row of data represents one household, and each row also contained the data about gender, age, and education for each member of the household. The individual data was denoted after the column name for the data category. For example, the age of the first member of a household would be in a column named “hv105_01”, and the second member in a column named “hv105_02”. Each row holds up to 48 members of the same household.
In order to locate each household, the raw household data from DHS was given locations using the rpoint function with the real adm0 population as a way to distribute the households according to the real population density. Then, the survey data was randomly sampled so that it would reflect the population of Gabon. The error of this method was calculated by comparing the number of observations in our sample to the sum of the weights, and this error was calculated to be 0.1857%.
I faced serious computational issues at the adm0 level when attempting to pivot from households to individual persons. The general approach was to pivot gender, age, and education individually first, and then merge them together into one dataframe. The individual pivots were not computationally expensive with about 30,000 observations each. However, merging the dataframes together ultimately took too much memory for my personal computer. I initially tried to circumvent this by using data.tables instead, but again the end result was that the data was too large to work with. I changed my approach by cutting as much redundant data as I could from my dataframes, and was able to pivot to 33,266 individuals.
Locating the households at the adm2 level for Bendjé mostly followed the same process as for Gabon as a whole. This time, the error was calculated to be 0.2019%. The following is a planar point pattern (PPP) generated for Bendjé.
And also a confirmation of the projection below.
The overall accuracy of the synthetically generated population is about the same level of accuracy for both adm0 and adm2 level, and looking at the distribution of size of households for Gabon compared to Bendjé shows that they are pretty similar. There are slight variations, and I think to increase the accuracy I could increase the sample size in the future.
Compared to an entirely random synthetic population, mine more closely approximates reality. Even just using the real population distribution as a function for how to give synthetic household locations will make an enormous difference. The household size distribution also closely resembles the real world data.
I also created heatmaps showing the correlation between size of household, age, gender, and education for the household data. One heatmap is created from raw data, and the other percentizes the data. From the charts, we can see that the percentized has more potential to make predictions.
In order to make predictions about education levels, the data was split 60/40 into training and test data sets. A multinomial regression model and ranger implementation of random forest were trained on the training data.
Comparing the ROC of the multinomial regression and ranger models respectively, we can see that they both had similar results. Looking at the area under the curve (AUC) of both graphs, none of them are very high which indicates that there is not a high degree of separability. Both models are slightly better than flipping a coin in distinguishing between classes, and the models are about equal in their ability to predict the classes.