Archive for the ‘expression frameworks’ Category

Regular Expressions: Patterns & Rules

Tuesday, January 11th, 2011

Image from AJ Brustein, used under CC license

Over the past week I’ve become acquainted with Regular Expressions, also known as regex by those familiar with this powerful tool. I have only begun to scratch the surface but I am already blown away by the power of this text parsing engine. Here I will provide a brief tutorial of how to use this tool with Processing, and share some useful resources for anyone who is interested in learning more about it.

What is Regex?
According to wikipedia’s definition regex is a formal language that “provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters… A regular expression can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.

In other words, regex provides a set of rules that enables users to search for and extract pieces of text from a larger text source (be it a file or a data variable). I used regex to extract specific bits of information – such as dates, titles and body copy - from over 300 posts to one of my blogs.

Regex has been incorporated into many different computer languages. Php, ruby, c++, python, and java all offer regex in slightly different flavors. For my project I used the regex functionality that is available in Processing.

Learning Regex
Now I’ll dive into a few technical examples to show you how it works. Here is a link to a good quick start guide that helped me get started, the information I will cover below all comes from this site. The way regex works is that you define a search pattern that i used by the regex engine as a query. Here is an overview of the most common elements of a regex pattern:

  • Literal Characters: these are the simplest type of regex patterns, they provide a straightforward match. For example, the pattern “cat” will match any instance of “cat” whether standalone or in words such as “catalog”, and “catastrophe”.
  • Character Classes: these patterns will match only one out of several characters. Character classes are identified by square parentheses. For example, the pattern “[bc]at” will match any instance of “cat” or “bat” whether standalone or in words such as “battery”, and “catalog”.
  • Shorthand Character Classes: these characters will match one character of a specified type, such as digits “\d”, word characters “\w”, or space characters “\s”. Using a capital letter, such as “\D”, creates a negated match that matches any character that is not of the specified type.
  • Non-Printable Characters: these characters enable you to add non-printable characters to your regex pattern. For example, “\n” specifies a new line feed while “\r” identifies a carriage return. Regex also supports the syntax “\xFF” to use a hexadecimal numbers to specify an ascii character.
  • The Caret: creates negated matches when used within square brackets. Similar to the way that capital letters function for the shorthand character classes, the caret allows the specification of characters that should not be matched. For example, “[^b]at” will match sequences that contain the characters “at” preceded by any character other than “b”.
  • The Dot: matches any character except line breaks. For example, the pattern “.at” will match any instance of “at” that is preceded by a non-line break character. It is important to note that this pattern will not match “at” all by itself.  The dot is very powerful and can easily cause unwanted matches if not used sparingly.
  • Anchors: these characters specify a location within a text source. The “^” specifies the beginning of a text source or line of text (depending on the regex mode), while “$” specifies the end of a text source or line of text. You can also use the “^” and “$” characters to only match characters that are placed at the beginning or end of words.
  • Alternations: the “|” character provides the equivalent of a logical “or” functionality. It is used to identify multiple different potential patterns for  a match. For example, “bat|cat” will match either bat or cat.
  • Optional: enables you to specify optional characters for a given match. For example, “c?at” has an optional “c” so it will match the pattern “at” or “cat” whether standalone or within a bigger word.
  • Repetition: the “*” and “+” characters can be used to enable a patterns to be matched repeatedly. They differ in that “*” matches a character 0 or many times, while the “+” character matches 1 or many times. For example, the pattern “ca*t” will match “ct”, “cat”, or “caat” while the pattern “ca+t” will match “cat” or “caat” but not “ct”.
  • Greedy and Lazy Patterns: the difference between greedy or lazy patterns is that greedy patterns (which are the standard) will expand the match as much as possible, while lazy patterns will keep the match as short as possible. Using the “?” character will make a match lazy. For example, for the string “cathat” the pattern “c[a-z]+t” will match “cathat” while the pattern ”c[a-z]+?t” will only match “cat”.
  • Groups: by using “()” we can specify groups. This is useful because we can apply quantifiers to these groups, and many regex-based functions are able to return the values of these groups separately (this was very useful for my sketch).
  • Lookarounds: these patterns are similar to anchors in that they enable you to find a position that is defined by a pattern without including this pattern in the overall match. For example, “(?=c)at” will match the “at” in cat but it will not match a standalone “at” because it looks for the “c” as an anchor.

Here is a link to the Regex Pal, this is a great tool to test out regex patterns.

Regex and Processing
Regex can be used with several functions within Processing. The two functions that I used to extract and clean text from the web page were:

string.matchAll(“regex”): this function takes the regex pattern and returns a two-dimensional array with the matching patterns and sub-patterns. For example, if applied to the text “cat cat hat” the pattern “[ch](at)” would return the following array: [0][0] “cat” and [0][1]“at”, [1][0] “cat” and [1][1]“at”, [2][0] “hat” and [2][1]“at”.

string.replace(“regex old text”, “new text”): this function takes a regex pattern as a first argument, and replaces the instances of this pattern with the new text provided as a second argument.

[image taken from AJ Brustein, used under CC license]


Patterns in Language

Friday, October 29th, 2010

Treemap of My Language from Last Week

For the period of one week I used this keylogging tool to capture all of the language I used while on my laptop. This investigation was an attempt to insights regarding my thought and moods – it was inspired by our focus on language in the class Rest of You. Once the data had been captured I cleaned up the resulting text file to removed some of the most common words that I knew would skew the analysis (e.g. “the”, “that”, “this”, etc).

As luck would have it, during this same week we reviewed a treemap visualization library in Expression Frameworks.  This treemapping library for Processing was created by Martin Wattenberg and Ben Bederson’s and adapted by Ben Fry. It provides a simple (ok, maybe not so simple) framework for developing treemap visualizations from hierarchical data. In the case of my data, the hierarchy was created by the frequency of each word.

After some time struggling to understand how the library worked, I was able to generate some visualizations that looked interesting. I am still a long way from truly understanding the nuances available in this library but I have succeeded in tweaking it to add dynamic coloring and to limiting the words that are displayed. Here is a link to my code on github.

From this treemap, you can tell that last week I was working a lot on stuff related to my csa, data and sensors, heart rates, and time. Two interesting directions that I plan to explore in future key logging efforts include looking at the moods of my words, as was done by Patricia Adler and Patrick Hebron, and exploring the process of writing on a computer, as Josh Clayton did in a brilliant manner.


Designing the Structure of Data Objects

Sunday, October 24th, 2010

The more programming I do, the more I understand the importance of the structure (or information architecture) defined in my code [not to mention, the nerdier the titles of my post begins to sound]. I have never doubted that this was true, especially after working for many years doing marketing for technology companies such as IBM, Microsoft, and HP. That said, I have only recently gotten a first hand understanding of how structure can limit or liberate the potential of my sketches (and also drive me crazy).

Recently, while working on multiple data visualization projects I’ve had to start thinking about the structure for capturing, managing and analyzing data. I’m not talking here about databases structures (these are much more complex); I am referring to the structure of objects that hold data. There are two main approaches that I have been using in my sketches, each has its own benefits and draw backs.

Let’s begin by envisioning a data table. Each column on this table holds a data field, while each row holds related entries in each field. Now let’s add one additional bit of complexity, most data visualization projects require that we read, manage, and integrate data from multiple different sources.

So here are the approaches that I’ve used so far (I’m sure there are many others that I haven’t explored): (1) entry-based objects that hold all data related to a given entry; (2) field-based objects that hold all entries related to a given field.

So which approach is better? Neither, it all depends on what you are trying to do. The first approach is often more effective for complex data sets that include different types of information, such as integers, floats, strings, and links. A benefit of this approach is that it enables the integration of data from multiple sources into the same object. The downside is that to use this approach you need to know the structure of the data.

The second approach provides a bit more flexibility because you can create a generic “field” data type that can be used to read an undefined number of fields (as long as they have the same datatype). If you want to create one application that reads different sets of data with similar data types this approach may be best.

Another benefit of this approach is that it makes it easier to compare multiple different values within a given field, since they are all contained within the same object.


Developing Framework for Visualizing Biometric Data Across Time.

Wednesday, October 20th, 2010

I know the title of this journal is as long as it is boring – I am not feeling too creative today. Nonetheless, I am quite happy that I have recently made progress on a bunch of data viz projects that I have been playing with during the past couple of weeks. Let’s take it one step at a time, so today I will focus my efforts on sharing a video of a simple visualization that I created to map data from a data collection session I did a few weeks back.

Here is a link to one of my previous journal entry where I briefly talked about this endeavor.

My efforts during this session focused on tracking my galvanic skin response, the light level in the room where I sat, and the proximity of others. I also took notes regarding my activities during this two-hour period. Unfortunately, I have not been able to integrate this information into my visualization yet.  As you can see, my visualization is essentially comprised of three bar charts coupled with a timeline running and a navigation running along the top of the screen.

Over the coming weeks I will integrate information about my activities in the area between the navigation bar and the timeline. This information will contain brief descriptions and images of the content I was consuming (and the people with whom I was interacting). This part of the project is very important because over the next several months I hope to keep a detailed track of my biometrics data and activities with the goal of creating a personal mood map.

The ultimate goal of this projects is to help me to lead a more fulfilling life by making me more aware of things that make me happy, sad, annoyed, elated, frustrated, and so on. At this point I am trying to put together some of the scaffolding to make this a reality. I have made good progress on the biometric circuits and I am continuing to improve the data visualization elements of this project. However, I still need to work on creating a process that will enable me to effectively capture data regarding my activities and moods.

Here is a link to the code for this visualization (it is available on GitHub). You can download the code directly from this link: https://julioterra@github.com/julioterra/Time_Bio_Mapping.git - Stay tunned, more on all of these threads will be posted soon…


Graphing Data, Identifying Patterns

Saturday, October 9th, 2010

Over the past two weeks I have been playing around with visualizing the data collected with my galvanic skin response, light and proximity sensors. This process has been much harder than I expected but well worth the insights that have been surfacing.

The difficulties that I encountered began with problems associated to reading in the data files, which had almost 1 million entries. That is right, I had captured almost 1 million readings from each of my sensors. In order to deal with this volume I decided to average the readings and have been forced to use an input buffer to read in the data (which I have not yet fully implemented).

Once I was able to solve the problems associated to reading in the data (which is not yet fully resolved) I moved on to playing around with the visualization. I have decided to stick to a traditional bar graph because I am focusing on cleaning up the code so that my future data capturing initiatives will be easier to handle. Next week I plan to try to test visualization approaches that provide an animation/time-based view of the data. Here is a link to my code and the first half of the data.

Enough about the technical difficulties. From the visualization exercise I noticed that I tend to get excited when people come close to me. Talking to people on the phone, even my parents, was much less exciting than having someone near by. I also got aroused (and stressed) by some NY times op-eds that I read about the upcoming election and the state of our current economy. Hopefully I will be able to incorporate this type of data into my visualization over the next week as well.


Transforming Data into Information

Monday, September 20th, 2010

Over the past week I have been exploring a wide variety of data sources and potential visualization approaches for my first project in the Expressions Frameworks class. After much consideration there are five ideas on my short list. These ideas differ in content and concept and I hope to be able to explore more than just one of these. Here they are in random order.

Record Breaking Tornadoes
Data Visualization Example by Tiffany Farrant, used under Creative Commons License

Idea 1: Investigation of the correlation between environmental degradation, agricultural development and increases in energy consumption. I envision this project as an interactive screen-based visualization that would aim to highlight the toll our agricultural practices and rampant energy consumption have had on the health of our planet. The data for this piece would be taken primarily from the world bank website. Three following three datasets would make up the foundation for this piece: Environment Data, Agriculture & Rural Development, Energy & Mining.

A potential extension idea for this piece would be to project the potential impact that clean energy technologies and new modes of agricultural production could have on helping us maintain the health of our environment. To create these two additional lenses I would need to identify sources for data regarding the potential impact of alternative energy sources such as solar, biofuels, wind, and biomass, and energy conservation and efficiency technologies and practices; from an agricultural perspective I consider investigating integrated models, urban and vertical farms, as well as natural systems farming.

Idea 2: Exploration of my moods and emotions over a period of several weeks and comparison of my personal experience with that of other New Yorkers. To bring this project to life I would develop a biometric tracking device that would track my excitement level and location throughout the day and would consider wearing a device that tracks my sleep patterns at night. The daytime tracking module would automatically log my heart rate, galvanic skin response, and GPS location. For sleep monitoring I would consider using a device such as the Zeo.

To augment the data captured automatically via these devices I would need to create a system that enables me to log activities and my mood. One of my peers, Shahar, had a great idea for meeting this requirement – creating an iPhone app that solicits input from me at randomized times. This would help limit my personal bias to provide updates only when I am feeling strong emotions.

One related idea that interests me is the possibility to track my states of mind during my yoga practices. Then to compare them to my state of mind while in other activities such that are stressful, mundane, exciting, etc.

Idea 3: Explore the value of Community Supported Agriculture in NYC. To achieve this I would need to compile data from existing sources augmented by independent research. Just Food is an organization that has a lot of basic data regarding CSAs, including lists of CSAs along with their location, foods offered, producer relationships, dates when they were founded and basic list of services provided.

Additional information would need to be collected from independent research and/or calculated from available information. For example, it would be interesting to understand what are the pick-up locations used by these communities, the variety of vegetables that are provided to each community, the number of core group members involved in running these communities, the amount of workshift hours delivered by members, etc.

Idea 4: On-site physical and sound expressions of energy consumption at NYU buildings. I recently found out that NYU has a data center that can provide students with access to various types of information about the NYU campus, including energy and resource usage. For this project I envision an intervention-type installation in several NYU buildings. The piece would actively interact with people who pass by it and would work best in residential locations – though it could also work in other locations.

Ideally, I would like access to live data so that the pieces could reflect near real-time energy consumption and provide the opportunity for direct feedback between the building inhabitants and the visualization. However, it could also be re-envisioned to work with data that is updated on a daily basis. I also see the possibility to extend this intervention to include game elements that would the piece itself to act as a medium for people in different buildings to interact with one another.

Idea 5: Exploration of the world of graffiti based on online content such as videos, pictures and text. The data for this project would be harvested from various online sources such as video and picture sites, as well as blogs and twitter feeds. The idea would be to create a map that links places, artists, and pieces to one another based on numerous potential sources of connections – e.g. places, work, styles, fans, etc.