The purpose of this data visualization is to show the density of a measure over time. The measure is an occurrence of a criminal or non-criminal activity or incident reported by the the University of Washington Police Department. The density is the frequency of the measure at a geocoded location over a period of time. What is shown here is, essentially, where officers were called to during the summer of 2013.

This is an example of a CartoDB Torque map. Used alone or with other maps or information, new patterns of human activity are shown and possibly understood for the first time. This map, used in conjunction with a previously completed Tableau viz, are an example of this. The patterns and information easily discernible on these maps without any narrative or explanation would be impossible to discern looking at the same data in tabular format. (The Tableau visualization of this same data is available here on my Tableau Public web site. Feel free to check it out!)

I think it would be interesting to organize the data by day of the week over the period of a year. What are the busiest days of the week, on average, and what types of incidents occur most frequently during different days of the week?

This post started out as a description of the workflow to use in creating your first map on CartoDB. It became too complicated to explain just in text, and would have been especially unhelpful given the richness of the CartoDB dashboard and their tutorials. Instead, I’ll comment on CartoDB and some experience after having worked with the technicals. I’ll also include a map that I created after just a few hours of learning their service.

Suffice it to say that I was so thrilled with my initial exploration that I signed up for the Magellan Plan as soon as my 14-day trial expired. Included below is the first, useful map I created. The data used is the billionaires data set I created previoulsy and used in a viz published on Tableau Public.

Achieving this success took about 10 hours of work over four sessions using only the CartoDB dashboard. The first nine hours were spent using a point geometry map as the learning example. I wanted a choropleth all along. I was able to complete and make publishable this choropleth in about one, additional hour. Here’s a summary of the workflow used to create the above choropleth map:

  1. Create your account
  2. Upload your data set
  3. Upload the world borders shape file
  4. Use SQL to join your data set and the world borders shape file
  5. Create the map
  6. Create the visualization
  7. Publish the visualization

When loading your data set into a table on your CartoDB account, the default geometry is a point. For your first map, keep it simple and create a bubble chart (using the default, point geometry). After you understand the workflow, only then move on to the polygon geometry and create a choropleth map. For a point map, skip steps 3 and 4.

CartoDB includes a world borders simple shape file with the free trial. The simple borders would be fine to kick around with and use in SQL on a free account. But, if you want to create beautiful visualizations and your’re going to sign up for a plan, to straight to the world border shape file that contains the richer detail. CartoDB might have this already available, I didn’t check. I used the world borders shape file that is freely available through from the Thematic Mapping website here.

I like to be able to control the initial zoom. The URL provided in the CartoDB embed zoom does contain a parameter for initial zoom; the initial value is 1. In the above vizz, I changed the initial value to 3. More on the API and available parameters for use in the URL can be found here.

The maps are easy to share.

The billionaires data set I previously created contains information from 1996 through 2013. I tried creating a torque visualization with this data set. It didn’t work well and I abandoned the effort. A feature that CartoDB should add when they’re done blowing minds with their implementation of torque visualizations, is a standard motion chart.

It takes several hours to gain an understanding of the dashboard-based workflow. Be prepared to create and delete tables…and be prepared to lose SQL as part of this until you learn how not to lose the SQL!

Wrap

The first takeaway is that a service for creating beautiful maps is available now. Their free trial is limiting and is really just a taster. You’ll quickly outgrow it if you are at all interested in seriously beautiful maps. To their credit, and based on my first-hand experience, their service is good and they are open to comments and suggestions for features. The tutorials are pretty good. It’s open source code available on Github. The tutorials are helpful. Plan pricing is fair.

Second, if you want to be more creative or solve problems not available through the the CartoDB dashboard, you will want to know some JavaScript or be willing to learn JavaScript. There are dashboard components where the JavaScript or CSS can be edited through the Dashboard interface. Otherwise, you will want to complete your visualization the best you can in the dashboard and then go the editor to manually edit your JavaScript or CSS.

Introduction

Theory informs us that tides are complex, influenced by geography of the undersea surface at the measuring station, gravitational forces of the Sun and Moon, and the oceanic basin a station is located in.[1], [2], [3] This analysis attempted to determine if there is a season in which the tidal range for Northern Hemisphere observing stations is greatest. The measure tidal range was defined as the difference in height between consecutive high and low waters.[4] The questions I sought to answer:

  1. Is the tidal range in the Northern Hemisphere greatest during a solstice or an equinox?[5] If so, which one?
  2. If a difference ws detected, does the season-to-season change fit a linear or non-linear pattern?

Theory says that the distance of a station from the Equator is just one factor in predicting a tide, and hence the tidal range. As a result, I didn't compare stations to each other without selecting a sufficiently large and random set of stations. Such a comparison was beyond the scope of this analysis. Instead, I aggregated water level observations to form a mean for a group of stations situated in a similar latitude, or region, and analyze the mean across seasons.

It might not actually be seasons that explain variation in tidal range but the degrees of tilt of the Earth's axis. I used seasons here as a proxy for the degrees of tilt.

Materials and methods

I performed secondary data analysis on water level data collected and curated by the National Oceanic and Atmospheric Administration (NOAA).[6] Because tide data is well kept, I did not expect variance in the data that was due to collection or curation methodologies. The datum I used was MLLW (mean lower low water), taken at six-minute intervals.

Data for 14 days before and 14 days after each solstice/equinox was be obtained. Each of the three data sets contained 29 days of data for each of the approximately 15 stations. This number of days, 29, was used to approximately match one, complete, 28-day lunar cycle, for each season.

The geodata for each station was converted from degrees, minutes, and seconds (DMS) to decimal using a freely available service[7]. Conversion of this data is not algortihmically difficult and I assumed no variance was introduced into the data as a result of errors in the conversion.

The map figure shows the locations and names of the stations selected for use in the analysis. The meaningful classification was by latitude. No regard was made to longitude; the hemispheric division in the figure was an artifact of the R technique used for coding locations.

My null hypothesis was that the tidal range for a region was the same during each of the four seasons. Since I did not have an expectation that tidal range does vary from season-to-season, this was the appropriate statement. I formed a null hypothesis and then tested the data. At the end of testing, I was able to show show whether or not the station data provided evidence to support my null hypothesis.

The statement of null hypothesis and testing procedure was as follows:

  •  H_0: \mu_1 = \mu_2 = \mu_3
  •  H_a: \mu_1 \neq \mu_2 \neq \mu_3
  • Fit the data to a linear model and perform a one-way ANOVA to obtain test statistics
  • If significance was determined, proceed to identify the groups that are different and attempt to explain why they are different

The respective values of  \mu in each test were the mean of the difference between respective high and low tides for one 29 day period for one region for one season. The season was the factor (categorical predictor value) that was used for the ANOVA.

An assumption I made was that that water levels for the autumnal equinox and vernal equinox are the same. As a result, I acquired and used data only the vernal equinox.

tidal-range-stations-map

Station locations were carefully selected. Where I had personal knowledge of the location, I made a judgement on whether or not to include the station. Where I did not have personal experience with the location of a station, I instead relied on Google Maps to identify physical features that that might introduce variance to observations. For example, stations near the Columbia Bar, near the entrance to San Francisco Bay, or at the head of a fjord in Alaska, were disregarded as candidate sites. The ideal site was one that had unimpeded ocean water flow. For this reason, the Harvest Oil Platform situated off the coast of Lompoc, CA, was considered ideal.

Counting a pairing of tides means that I started with the first high tide in the data for the station. It was possible that the acquired data did not include the same count of high and low tides. I took this into account by avoided pairings of tides by taking calculating the range as the difference between the mean high water and the mean low water. (See the Appendix for acknowledgement of an error in this approach.)

Results

I began the review of my results by examining the data. The data shown in this table was aggregated from approximately 105,000 observations. The size of the resulting data set used in the linear models was only 39 records, 15 records for the low- and mid-latitude regions and 13 for the high latitude region. This table of descriptive statistics shows the data ordered by region and season:

To aid the reader in gaining some familiarity with the data, the columns can be interactively sorted by mouse clicks on the column header.

Interesting patterns are revealed when the data are plotted in a boxplot of season by region. I can see quite clearly that in this data set, the median tidal range increases as regions move from the a low latitude to a high latitude. The interquartile range (IQR) of the high latitude tidal range was far wider than the range of either of the other two regions. Very clearly, there was far more variation in the water level readings obtained for the stations in Alaska. I discuss these characteristics of this data set further in the Discussion section.

Tidal range by season and region

The summary statistics from my ANOVA are summarized as follows:

  • Region low latitude: F = 0.1204, with a p-value = 0.89.
  • Region mid latitude: F = 0.6631, with a p-value = 0.53.
  • Region high latitude: F = 0.0058, with a p-value = 0.99.

With one exception, the diagnostic plots, residuals against fitted values and qqnorm of residuals, appear to conform with the normality and similar variance across seasons. The exception was the high region that contains data from stations in Alaska.

It's possible that the non-normal distribution for the high region was explained in part be explained by the stations I selected. The stations include Prudhoe Bay and Nome, AK, sites that are known to have frozen ocean water at the site for several months of the year. I am not educated on the impact sea ice has on the stations being able to record the water level. Further, the three remaining sites for the high region are located along the eastern shore of the Gulf of Alaska, an area known to have a large tidal range.

Summarizing my results, the p-values for each of the three regions are very high, suggesting that it was not very unlikely to obtain the observed tidal ranges given the data available. I therfore fail to reject the null hyptohesis and conclude that there was not enough evidence to suggest that there was a relationship between the season and tidal range.

Discussion

In looking closely at the number of observations, column n, in the table of data at the beginning of the Results section, there are six stations that have a lower than expected number of tidal range pairings. I discuss this in the Appendix.

Notwithstanding, the approach was valid and the results are sufficiently lacking any evidence for an effect; I am not able to draw a conclusion that there was an effect on tidal range due to season.

The boxplot introduced in the Results section above shows an apparent pattern between latitude and tidal range. The pattern is shown here:

tidal-range-by-season

Visually inspecting this chart provides further confidence in my results that the season does not explain the variance in tidal range. But, the primary reason for sharing this chart is that it appears that there is a pattern of the tidal range increasing as the distance from the Equator increases, i.e., distance from the Equator would explain a significant portion of tidal range. However, I advise against studying this line of reasoning as there is not necessarily a correlation between latitude and tidal range. In the data selected for the high latitude region here, I know that three of the stations have a particularly large tidal range compared to the readings obtained at the average station.

An interesting next step in investigating this data would be to analyze the distribution of the times between tidal events. I know that natural phenomenon occur according to distributions that are understood and can be modelled. I would like to analyze this data set to see if the distribution of tidal events over time occur according to this distribution.

The ANOVA tables should be included in the next round of edits to this report.

Two covariates could have been done with a single ANOVA and a region x season interaction rather than doing a one-way ANOVA thress times.

Sources and references

  1. Laplace theory of tides: http://en.wikipedia.org/wiki/Theory_of_tides
  2. Tidal nodes: http://en.wikipedia.org/wiki/Amphidromic
  3. Discussion of earth tides: http://en.wikipedia.org/wiki/Earth_tide
  4. See range of tide, National Oceanic and Atmospheric Association (NOAA) glossary: http://tidesandcurrents.noaa.gov/glossary.html#R
  5. Discussion on tidal datum: http://tidesandcurrents.noaa.gov/publications/tidal_datums_and_their_applications.pdf
  6. Source data, retrieved by station location: http://tidesandcurrents.noaa.gov/
  7. Conversion of DMS to decimal for station locations: http://andrew.hedges.name/experiments/convert_lat_long/

Appendix

The custom code written for this analsysis is included below. As discussed in Conclusions, further work is necessary to control for fluctations in water level near maxima and minima.

The approach taken in this analysis was to determine the local maxima and minima by comparing the mean of x number of future data points in the series to the current observation. Then skipping y number of observations to avoid falsely identifying the next maxima or minima. The ideal values for the two parameters x and y needs to be further evaluated and perhaps region needs to be taken into account.

# Build a vector of high water levels and a vector of low water levels. One value
# in each vector represents a high or low tide, respectively.
# Parameters: Vector of water levels.
# Preconditions: Vector passed in numeric. No validation is performed.
# Post conditions: None
# Returns: Index value of low tide.
buildWaterLevels <- function(waterLevel = waterLevel) {
    lowWater <- rep(0, 60)
    highWater <- rep(0, 60)
    i <- 1  # index for high water
    j <- 1  # index for low water
    k <- firstLowTide(head(waterLevel, 500))  # index for water level
    lowWater[j] <- waterLevel[k]
    j <- j + 1
    lookForHighTide <- TRUE  # high tide is TRUE, low tide FALSE
    moreObs <- TRUE
    while (moreObs) {
        if (lookForHighTide) {
            if (waterLevel[k + 1] >= waterLevel[k]) {
            k <- k + 20  # hack to move past fluctations that might result in false high/low tide
                if (k < length(waterLevel) && waterLevel[k + 1] < waterLevel[k]) {
                    highWater[i] <- waterLevel[k]
                    i <- i + 1
                    lookForHighTide <- FALSE
                }
            }
        } else {
            if (waterLevel[k + 1] <= waterLevel[k]) {
                k <- k + 20
                if (k < length(waterLevel) && waterLevel[k + 1] > waterLevel[k]) {
                    lowWater[j] <- waterLevel[k]
                    j <- j + 1
                    lookForHighTide <- TRUE
                }
            }
        }
        if (k > length(waterLevel)) {
            moreObs <- FALSE
        }
    }
    if (i > j) {
        highWater <- highWater[1:j - 1]
        lowWater <- lowWater[1:j - 1]
    } else {
        highWater <- highWater[1:i - 1]
        lowWater <- lowWater[1:i - 1]
    }
    return(list(highWater = highWater, lowWater = lowWater))    
}

# Identify the first low tide in a set of water level data.
# Parameters: Vector of water levels.
# Preconditions: Vector passed in numeric. No validation is performed.
# Post conditions: None.
# Returns: Index value of low tide.
firstLowTide <- function(waterLevel) {
    k <- 1
    higher <- TRUE
    while (higher) {
        if (waterLevel[k + 1] <= waterLevel[k]) {
            k <- k + 1
        } else {
            higher <- FALSE
        }
    }
    return(k)
}

All the code will eventually be made freely available in a repo on Github under user sculpturearts.

When working with maps, it’s a common task to want to fill a map based on some classification and associate it using color with other countries in the same classification. A code table that includes both the ISO 3166 codes and the United Nations codes is not available…so I made one. The World Health Organization (WHO) regions weren’t part of my original plan but I ended up including these, too.

The table is freely available for download here.

This visualization is an example of the usefulness of the combined table. The source data was in ISO 3166 but we wanted to emphasize the United Nations region category.

How to use the table

The data is saved in tab delimited format. I recommend first importing the table into Excel and exploring it. A feature that you will see right away if you look at Afghanistan is that the country is listed twice. Scrolling over to the right, you’ll see that the United Nations designates Afghanistan as a country in both the “Least developed countries” and the “Landlocked developing countries” categories. You can either treat they key as one of the alpha-2, alpha-3, or number-3 codes and remove one of the records contains a U.N. category you don’t want to use. Or, keep both records and retain flexibility to display Afghanistan as both landlocked developing country and a least developed country. Using count distinct on a key field elsewhere in your application will still provide a count of the countries you have a code set for.

Most of the code values and descriptions will be familiar and irrefutable. Other information is not established as a convention. For example, the U.N. acknowledges that “there is no established convention for the designation of developed and developing countries or areas.” Refer to the footnotes at the U.N. source link for further information.

Missing value indicator: Values that are not defined by the sponsor, or not applicable, are indicated with a code value of --, two dashes.

The oceans are included. There is no international convention for ISO codes for oceans. The ISO 3166 standard allows for the implemntation of X* series codes to meet user implementation requirements.

A code is included for high altitude.

For the oceans and high altitude only, just the three fields isoCountryName, shortEnglishName, and isoAlphaTwo are implemented in the code table. All remaining values are coded with the missing value indicator.

Both the ocean codes and the high altitude code can be be safely deleted without impacting any of the country designations.

Column descriptions

Name Type Description
isoCountryName char(44) The ISO country name.
shortEnglishName char(44) A friendly, short name in English. ISO provides a friendly English name. Some were still a bit awkward and I took liberties to make theses even more friendly. I used best judgement but you might want to make some adjustments depending on your application.
isoAlphaTwo char(2) ISO alpha two.
isoAlphaThree char(3) ISO alpha three.
isoNumThree char(3) I’ve seen this information implemented as an integer, e.g., Afghanistan’s ISO num three is 4 instead of 004. I followed the ISO convention. Implement as you need to for your application.
whoRegion char(4) Not all nations are WHO member states. As a result, not all nations in this list are associate with a WHO region. Countries not coded by the sponsor are coded here with the missing value indicator.
unWorldThree char(4) The UN world region value. All countries have the value 001.
unRegionThree char(3) The UN num three code value. As above with the ISO num 3, these code values are implemented here as released by the sponsor, with the 0 fill character.
unRegionName char(8) Name of the UN region.
unRegionSubregion char(30) The UN subregion code value.
unRegionSubregionName char(25) The name of the subregion. Countries not coded by the sponsor are coded here with the missing value indicator.
unSelectThree char(3) One of three code values. Countries not coded by the sponsor are coded here with the missing value indicator.
unSelectName char(31) Description of the select categorization.
unDevelopedThree char(3) UN code value for the developed region category. Countries not coded by the sponsor are coded here with the missing value indicator.
unDevelopedName char(32) Developed type. Countries not coded by the sponsor are coded here with the missing value indicator.
unDevelopedRegionName char(64) Developed regions name. Review the source information at the UN site to understand how this categorization is implemented. This name should be edited to make sense in the context of your application. Countries not coded by the sponsor are coded here with the missing value indicator.

Sources

ISO Online Browsing Platform: https://www.iso.org/obp/ui/#home
United Nations Statistics Division, Standard country and area codes classification: http://unstats.un.org/unsd/methods/m49/m49regin.htm
WHO regions: http://www.who.int/about/regions/en/index.html

An extension of this table would include adding one or more of the other WHO region groupings.

I will try to keep the code table current.

Contact

You are welcome to submit a suggestion or bug report as a comment.

Here’s a Tableau visualization I prepared using publicly available data of incident log information for the police department at the University of Washington.

The next steps involve refining the classification of the Department’s 50 different call types and understanding if there are even more different types that are encountered during particularly busy periods, such as during the regular academic quarters, or possibly during football games.