Hi. I’m Mark, and I’m a FLUNKy. FLUNK stands for the Functional Language User’s Network in Kalamazoo. We meet the second Wednesdays of the month at 5:30 at the Biggs|Gilmore office in downtown Kalamazoo (261 E Kalamazoo Ave).
One of our recent problems involved writing a program that would determine if a given chemical equation was balanced or not. The input was the equation in string format (all of the subscripted numbers were made normal to simplify the parsing), and the program simply had to return True or False for each. If your high school chemistry is failing you, “balanced” means having the same number and type of atoms on each side of a reaction. For example, the equation
2Na + 2H2O = 2NaOH + H2
is balanced, while this equation
C6H12O6 = 3C2H2 + 3O2
is not (there are 6 hydrogen atoms missing from the right side).
My basic solution to the problem was to first break up the input string into its constituent parts, or tokens, and then count the number of atoms on each side of the equation. Atoms on the left side of the equals sign would be added to the total, and atoms on the right would be subtracted. The equation would be declared balanced if all of the atom tallies were 0 at the end.
A token of goodwill
Before I could tokenize the original string, I needed to decide what actually counted as a token – what kinds of characters do I actually care about? I settled on this list:
- A number
- A string (of 1 or more characters)
The plus sign would denote the end of one molecule, and the beginning of another. The equals sign would denote the end of the left side of the equation, and the beginning of the right. These transitions would prove useful later when we’re tallying up atoms.
A “number” could be a coefficient preceding a molecule, or a subscript following an atom. The precise meaning of each number would be determined later, when the placement of the number relative to the atom was taken into account (see the next section for the specifics).
A “string” would be single atom. Some atoms like Hydrogen, Oxygen, and Nitrogen are abbreviated with a single, upper case letter (H, O, and N, respectively). Most of the rest, such as Sodium, Helium, and Silver are abbreviated with 2 letters – the first uppercase and the second lowercase (Na, He, and Ag, respectively). When we first reviewed the solutions to this problem, someone brought up that there is a small, but growing, number of atoms that have three letter abbreviations. Starting with element 110, the elements are abbreviated with 3 letters – the first uppercase, and the following two lowercase: Ununnunium (Uun), Ununbium (Uub), and so on. Once I got my solution to work for 1- and 2-letter elements, extending it to 3-letter elements turned out be, well, elementary.
Here is my Tokenize method:
// This will recursively tokenize our original equation
let rec Tokenize acc = function
// end of list; terminates; tokens are pulled off in reverse order, so we re-reverse the list here
|  -> List.rev acc
// Ignore whitespace
| whiteSpace :: restOfList when Char.IsWhiteSpace(whiteSpace) ->
Tokenize acc restOfList
// Look for triple letter atoms
| firstChar :: secondChar :: thirdChar :: restOfList when Char.IsUpper(firstChar) && Char.IsLower(secondChar) && Char.IsLower(thirdChar) ->
Tokenize (Token.String(firstChar.ToString() + secondChar.ToString() + thirdChar.ToString()) :: acc) restOfList
// Look for double letter atoms
| firstChar :: secondChar :: restOfList when Char.IsUpper(firstChar) && Char.IsLower(secondChar) ->
Tokenize (Token.String(firstChar.ToString() + secondChar.ToString()) :: acc) restOfList
// Look for single letter atoms
| firstChar :: restOfList when Char.IsUpper(firstChar) ->
Tokenize (Token.String(firstChar.ToString()) :: acc) restOfList
// Look for triple digit numbers
| firstDigit :: secondDigit :: thirdDigit :: restOfList when Char.IsDigit(firstDigit) && Char.IsDigit(secondDigit) && Char.IsDigit(thirdDigit) ->
Tokenize (Token.Number(System.Int32.Parse(firstDigit.ToString() + secondDigit.ToString() + thirdDigit.ToString())) :: acc) restOfList
// Look for double digit numbers
| firstDigit :: secondDigit :: restOfList when Char.IsDigit(firstDigit) && Char.IsDigit(secondDigit) ->
Tokenize (Token.Number(System.Int32.Parse(firstDigit.ToString() + secondDigit.ToString())) :: acc) restOfList
// Look for single digit numbers
| firstDigit :: restOfList when Char.IsDigit(firstDigit) ->
Tokenize (Token.Number(System.Int32.Parse(firstDigit.ToString())) :: acc) restOfList
// Look for +
| plusSign :: restOfList when plusSign.Equals('+') ->
Tokenize (Token.PlusSign :: acc) restOfList
// Look for =
| equalsSign :: restOfList when equalsSign.Equals('=') ->
Tokenize (Token.EqualsSign :: acc) restOfList
// All other characters, just terminate
| _ -> 
This is invoked with this line:
let Tokens = Tokenize  (List.ofSeq equation)
Where “equation” is my original string, converted to a sequence of characters. The Tokenize routine performs pattern matching by breaking off the first few characters from the current sequence, and examining them. When it finds a rule that matches, it creates a token out of those characters, adds that token to the accumulator, and recurses with the rest of the string. Each “rule” here is denoted with a pipe ( | ). Let’s walk through them one by one:
- The default case is where the incoming sequence is empty. If that’s the case, it simply returns the accumulator, acc. This case gets hit when we’ve parsed the entire original equation.
- The next rule skips past any whitespace in the equation. In this case, it strips the whitespace out, and recurses with the rest of the string.
- The next three rules look for atoms. First, we look to see if the next three characters are an uppercase letter followed by two lowercase letter. If that test fails, then look for one uppercase letter and one lowercase letter. If that test files, then look for just an uppercase letter. In all three of these, we make a String token out of the letters found, add that token to the accumulator, and recurse with the rest of the original string. The order of these three rules is important. If they were reversed, the “N” of a sodium (Na) atom would be treated as its own token.
- The next three rules look for numbers. My solution only supports 1-, 2-, and 3- digit numbers. One of the other solutions I saw (in Scheme, if I remember correctly) was able to incorporate a regular expression into the pattern matching so that any length number would be found correctly. When I wrote this, I couldn’t figure out how to weave the regex into the pattern matching rules. The individual digits numbers are converted to a string, merged and converted to a Number token, and then added to the list.
- The next rule looks for the plus sign. A PlusSign token is added to the list here.
- The next rule looks for the equals sign. An EqualsSign token is added to the list here.
- The final rule looks for any other character in the string, and terminates if any are found.
So, Tokenize will taken a string like this
2Na + 2H2O = 2NaOH + H2
and break it up into a list of tokens:
2 Na + 2 H 2 O = 2 Na O H + H 2
I am a beautiful and unique snowflake
The next step was to get a unique list of the different atoms in the entire equation. That turned out to be one of the most elegant lines of code in the entire solution:
let UniqueAtoms = Tokens |> List.filter (fun t -> (t.GetType().Name.Equals("String"))) |> Set.ofList |> Set.toList
The |> operator is kind of like a pipe ( | ) in Unix – it takes the output of the previous function, and passes it in as the input of the next one. With that in mind, let’s break this line down:
- Start with Tokens (the list of all tokens created in the previous step).
- Run it through List.filter, and find all of the tokens that have the name of “String”. These are the atoms.
- Convert that filtered list to a Set…
- And then convert it back to a List. Store this in UniqueAtoms.
The real magic here is the last two steps – converting the list to a set and back again automatically de-dups the list. Voila! A list of unique atoms.
Now it’s time to count. We want to count up the number of atoms on each side of the equals sign, taking into account the molecule coefficients and the atom subscripts. I decided that I would maintain a list of key-value pairs, one pair for each atom. The key would be the atom token, and the value would be the total number of that kind of atom in the equation (from both sides). That list is called AtomCounts, and it is populated with this line:
// Next, create a list of key/value pairs, to hold the count for each atom.
// Initialize the counts by calling SumAtoms for each element
let AtomCounts = [ for atom in UniqueAtoms do yield (GetTokenValue atom, SumAtoms atom 0 1 1 Tokens) ]
This is another dense line, so we’ll take it in pieces:
- This line creates a list using the square brackets [ ].
- It populates that list using a “for” statement that walks through every “atom” in the UniqueAtoms list.
- It yields key-value pairs (also known as tuples) where the key is the value of the “atom” token, and the value is the net sum of that kind of atom in the equation.
To get the value of the “atom” token (meaning the string like “Na” or “H” that represents the atom itself), I wrote a small function called GetTokenValue:
let GetTokenValue myToken =
match myToken with
| Number n -> n.ToString()
| String s -> s
| _ -> ""
The SumAtoms function takes 5 parameters:
- The atom token that we’re looking for.
- A “0” for the initial tally for this atom.
- A “1” for the initial sign. While we’re counting atoms on the left of the equals sign, the sign will be positive. When we hit the equals sign, we’ll flip this to “–1” so that we subtract instead of add. (Remember, the goal is to get all of the atoms to tally to 0; that will mean the equation is balanced.)
- A “1” for the initial multiplier. The “multiplier” in this case corresponds to the molecule coefficient. If we have a molecule like H2O, then we have 2 hydrogen atoms and 1 oxygen, and there is an implied multiplier of “1” here. If we have a molecule such as 4H2O, then the “4” is the multiplier for this atom, and as a result we have 8 hydrogen atoms and 4 oxygen.
- The “Tokens” list, which is what we’ll parse through.
The SumAtoms function is as follows:
// This will recursively count up the number of each atom in the token list
// NOTE: The list that this operates on is the last parameter passed in; it is not shown in the parameter list here
let rec SumAtoms myAtom acc sign multiplier = function
// If we're done processing tokens, then just return
|  -> acc
// If we find our atom-with-subscript, increment the accumulator accordingly
| currentToken :: currentSubscript :: restOfList when currentToken = myAtom && currentSubscript.GetType().Name.Equals("Number") ->
SumAtoms myAtom (acc+(multiplier * Convert.ToInt32(GetTokenValue currentSubscript) * sign)) sign multiplier restOfList
// If we find our atom-without-subscript, increment the accumulator accordingly
| currentToken :: restOfList when currentToken = myAtom ->
SumAtoms myAtom (acc+(multiplier * sign)) sign multiplier restOfList
// If we find a different atom-with-subscript, skip it and keep moving
| currentToken :: currentSubscript :: restOfList when currentToken.GetType().Name.Equals("String") && currentSubscript.GetType().Name.Equals("Number") && currentToken <> myAtom ->
SumAtoms myAtom acc sign multiplier restOfList
// If we find a different atom-without-subscript), skip it and keep moving
| currentToken :: restOfList when currentToken.GetType().Name.Equals("String") && currentToken <> myAtom ->
SumAtoms myAtom acc sign multiplier restOfList
// Once we hit the equals sign, change the sign to negative, and reset the multiplier
| currentToken :: restOfList when currentToken = EqualsSign ->
SumAtoms myAtom acc -1 1 restOfList
// If we hit a coefficient, it becomes the new multiplier
| currentToken :: restOfList when currentToken.GetType().Name.Equals("Number") ->
SumAtoms myAtom acc sign (Convert.ToInt32(GetTokenValue currentToken)) restOfList
// Once we hit a plus sign, reset the multiplier
| currentToken :: restOfList when currentToken = PlusSign ->
SumAtoms myAtom acc sign 1 restOfList
// Ignore all other tokens
| _ -> acc
The basic idea is the same as with the Tokenize function – break off pieces of the list, and compare it to the rules defined. When we find a match, process it and recuse with the rest of the list. Let’s look at the rules one by one:
- If the list we’re processing (Tokens, in this case) is empty, then just return the accumulator, “acc”, which has the tally of the net number of this atom in the equation.
- If we find the target atom with a subscript after it, add “multiplier * subscript * sign” to the accumulator. This adds (or subtracts if we’re on the right side of the equation) the total number of this atom in this molecule. Recurse with the rest of the list, leaving the sign and multiplier values alone.
- If we find the target atom without a subscript after it, then add “multiplier * sign” to the accumulator. Recurse with the rest of the list, leaving the sign and multiplier values alone.
- If we find a different atom, skip it, and recurse with the rest of the list, leaving the sign and multiplier values alone.
- If we hit the equals sign, recurse with the rest of the list, but this time change the sign to –1. The multiplier is left alone.
- If we hit a coefficient, recurse with the rest of the list, but this time change the multiplier to the newly-found coefficient. The sign is left alone.
- If we hit a plus sign, then it means that we finished one molecule, and are about to start another. Recurse with the rest of the list, but reset the multiplier to 1, and leave the sign alone.
- If we hit any other token, stop and return the accumulator.
The last step is to see if all of our tallies are 0. If they are, it means that the equation is balanced:
GoodSoFar <- List.forall(fun (key,value) -> value = 0) AtomCounts
This iterates through the AtomCounts list, and makes sure every value is 0. If it is, then GoodSoFar is set to True. This single, Boolean value is returned from the function.
I’ve already mentioned one of the limitations of this solution – it only supports 1-, 2-, and 3-digit numbers.
A second limitation to this solution is it won’t handle “functional groups”. This is a different kind of notation that arranges the atoms in the molecule to show the structure of the molecule better. The poster-child molecule was boric acid: B(OH)3. The “3” here is actually a coefficient that applies to everything in the parentheses. For my solution to handle this correctly, you have to rewrite it as BO3H3. Modifying my solution to handle this would require new rules in Tokenize to recognize the parens correctly, and then new rules in SumAtoms to treat the trailing number properly.
These aren’t the only two. I remember seeing even more complicated notation in high school and elsewhere, so this solution definitely has a way to go being general.
This was the third or fourth F# program that I wrote, but it was the first time that I felt like I started to grasp some of the core concepts of a functional language – recursion and immutables. The latter refers to a variable that gets created and initialized once, and then never changes value after that. It becomes extremely useful when the program is running in parallel – no more synchronization or locking of data variables is needed.
This was also the first time I tried invoking an F# program from a VB unit test. All of the logic above was developed test-driven, and my unit tests consisted of various test equations. This was an important step for me to take because I think it will be much more like that I would write a library in F# and invoke it from a VB/C# application, rather than building a completely solution in F#.
My complete solution, including the unit tests, can be found in the archive “FSharp_BalancingChemicalEquations.zip” here: http://tinyurl.com/3qwvgo9. It was developed with Visual Studio 2008, with F# installed (the latter can be downloaded for free here: http://research.microsoft.com/en-us/um/cambridge/projects/fsharp/release.aspx). The unit tests were written with NUnit 2.5.2.