Automatically substituting missing data

  • Thread starter Michael DiCostanzo
  • Start date
M

Michael DiCostanzo

I'm doing a project for a client where I have inherited someone else's
data which is kept in Excel files. Here's what I want to do.

The data is updated each hour. Sometimes we get no data or bad data
for short periods of time. A sample of the data is shown below.

In this example, there is no data for hours 5-6. Since we have two
hours worth of missing data, we are required to substitute for the
missing data by taking the average of the data two hours before and
two hours after the period when the data is missing. (These are the
rules we have to live by.)

In this case, the numbers would be calculated as follows:

(7.60 + 7.55 + 7.45 + 7.50) / 4 = 7.525

The value of 7.525 is then substituted into the empty cells for hours
5-6 as shown in the Corrected_Data column.


Hour Raw_Data Corrected_Data
1 7.51 7.51
2 7.45 7.45
3 7.60 7.60
4 7.55 7.55
5 7.525
6 7.525
7 7.45 7.45
8 7.50 7.50

If there were 3 hours of missing data, then we would have substituted
for the missing data by taking the average of the data three hours
before and three hours after the period when the data is missing. I
currently do the substitution manually. I really want to append the
raw data into a spreadsheet and then calculate the Corrected_Data
automatically.

Can anyone tell me how to do that in Excel? As I mentioned, I've
inherited this system, so I have to live with it for now. We're
looking to replace it and if the answer is that this can't be done in
Excel, that's OK. I want to transition to a newer system as smoothly
as possible.

Thanks in advance
 
F

Frank Kabel

Hi Michael
try the following approach (using helper columns to make the formulas a
little bit less complex)
- Add a column B as helper column and enter the following array formula
in B2 (this is your second data row as row 1 always should contain
data). Formula ist entered with CTRL+SHFT+ENTER
=IF(A2="",MIN(IF(A3:$A$20<>"",ROW(A3:$A$20),20)),"")
copy down for all rows

- add the helper column C and enter the following formula in C2 (also
array entered)
=IF(A2="",MAX(IF(A$1:A1<>"",ROW($A$1:A1),0)),"")
and copy down

- if you like you can hide columns B+C
- now enter the following formula in D2 (this is the column with your
corrected data)
=IF(A2="",SUM(INDIRECT("A" & C2 & ":A" & C2-(B2-C2-2)),INDIRECT("A" &
B2 & ":A" & B2 + (B2-C2 - 2)))/(B2-C2-1),"")
and copy down




Notes:
this approach won't work in all circumstances. e.g.
- the first row is empty (as there is no data before
- the last row ist empty
- There are not enough rows before or after
 
F

Frank Kabel

Hi Nigel
no need for VBA, can be done with formulas :)
Though there're some missing spots in the OP's specification
 
G

Greg Wilson

This was rather rushed but I think it does what you want.

Note that there is a potential complication not accounted
for, specifically: If a period of missing data exceeds
the gap between the next period of missing data then the
forward averaging will include blank cells thus returning
an inappropriate result. This can be corrected for but it
will make the code substantially more complicated. Your
description didn't take this into account so I assume it's
not very likely. If you need this corrected for you'll
have to specify the rules.

You may have to correct for wordwrap corruption of the
code.

Assumptions:
1) Hourly data is in Column A
2) Raw data is in Column B
3) Corrected data is to be placed in Column C

Sub CorrectData()
Dim BlankRng As Range, AvgRng As Range
Dim Rng As Range, C As Range, CC As Range
Dim Rw As Long, i As Integer
Dim Avg As Single
i = 0
Rw = Cells(Rows.Count, "A").End(xlUp).Row
Set Rng = Range(Cells(1, 2), Cells(Rw, 2))
For Each C In Rng
If C = "" Then
i = i + 1
ElseIf i > 0 Then
C.Offset(, 1) = C
Set AvgRng = Range(Range(C.Offset(-i - 1), _
C.Offset(-2 * i)), Range(C, C.Offset(i - 1)))
Avg = Application.Average(AvgRng)
Set BlankRng = Range(C.Offset(-i), C.Offset(-1))
For Each CC In BlankRng
CC.Offset(, 1) = Avg
Next
i = 0
Else
C.Offset(, 1) = C
End If
Next

End Sub

Be advised I'm just a VBA student.

Regards,
Greg
 
F

Frank Kabel

Hi Nigel
I agree that the OP has to clarify his specification :)
But what do you mean with 'Still end with missing data elements'. At
least for his example this works (though I'm also wondering how such
source data could be created...)

Greetings
Frank
 
F

Frank Kabel

Hi Nigel
ah, now I understand and yes you're corect (e.g. if the first entry is
missing or the second and third)
 
M

Michael DiCostanzo

Thanks all of you for your quick responses. Let me answer some of the
questions you've raised.

First, there is plenty of historical data (2 years worth), so we're
never going to have a situation where there is no data at the start.
For the purposes of this exercise, you can assume that we have plenty
of data at the beginning.

Second, I need to clarify the rules of substitution better. For
periods of missing data, N [where N is the number of hours of missing
data (< 24 hours)] we take the average of the readings N hours before
and N hours after the period of missing data. We can also have more
than one period of missing data in a 24 hour period.

So in the example I gave, we could have the following situation:

Hour Raw_Data Corrected_Data
1 7.51 7.51
2 7.45 7.45
3 7.60 7.60
4 7.55 7.55
5 7.525
6 7.525
7 7.45 7.45
8 7.50 7.50
9 7.25 7.25
10 7.30 7.30
11 7.45 7.45
12 7.30 7.30
13 7.30 7.30
14 7.30 7.30
15 7.30 7.30
16 7.45 7.40
17 7.50 7.50
18 7.475
19 7.475
20 7.50 7.50
21 7.45 7.40
22 7.30 7.30
23 7.30 7.30
24 7.30 7.30

For hours 18-19, the missing data would be calculated as (7.45 + 7.5 +
7.5 + 7.45)/4 = 7.475.

In this situation, the calculation has to be done twice. I usually
substitute the missing data a day or two later. In this way I assure
myself that I have sufficient data after the periods of missing data
to do the calculation.

Third, you're going to ask, what about periods of missing data that
are longer than 24 hours? The rules that we must follow state that in
this case we must use the maximum 1 hour value from the previous 30
days, not counting the current day. That value is substituted for all
periods of missing data greater than 24 hours.

Obviously, we try to avoid that situation. I don't expect that we'll
get periods of missing data longer than 24 hours now because I'm
paying very close attention to the quality of the data now.

What I'd like to do is at least handle periods of missing data less
than 24 hours. I'd like to put the raw data into an Excel file and
then copy the formula down to handle the missing data for me.

Thanks again for your response, the formula you provided does work.


Michael
 
T

Tom Ogilvy

So the question is answered or are you still looking for an answer?

--
Regards,
Tom Ogilvy



Michael DiCostanzo said:
Thanks all of you for your quick responses. Let me answer some of the
questions you've raised.

First, there is plenty of historical data (2 years worth), so we're
never going to have a situation where there is no data at the start.
For the purposes of this exercise, you can assume that we have plenty
of data at the beginning.

Second, I need to clarify the rules of substitution better. For
periods of missing data, N [where N is the number of hours of missing
data (< 24 hours)] we take the average of the readings N hours before
and N hours after the period of missing data. We can also have more
than one period of missing data in a 24 hour period.

So in the example I gave, we could have the following situation:

Hour Raw_Data Corrected_Data
1 7.51 7.51
2 7.45 7.45
3 7.60 7.60
4 7.55 7.55
5 7.525
6 7.525
7 7.45 7.45
8 7.50 7.50
9 7.25 7.25
10 7.30 7.30
11 7.45 7.45
12 7.30 7.30
13 7.30 7.30
14 7.30 7.30
15 7.30 7.30
16 7.45 7.40
17 7.50 7.50
18 7.475
19 7.475
20 7.50 7.50
21 7.45 7.40
22 7.30 7.30
23 7.30 7.30
24 7.30 7.30

For hours 18-19, the missing data would be calculated as (7.45 + 7.5 +
7.5 + 7.45)/4 = 7.475.

In this situation, the calculation has to be done twice. I usually
substitute the missing data a day or two later. In this way I assure
myself that I have sufficient data after the periods of missing data
to do the calculation.

Third, you're going to ask, what about periods of missing data that
are longer than 24 hours? The rules that we must follow state that in
this case we must use the maximum 1 hour value from the previous 30
days, not counting the current day. That value is substituted for all
periods of missing data greater than 24 hours.

Obviously, we try to avoid that situation. I don't expect that we'll
get periods of missing data longer than 24 hours now because I'm
paying very close attention to the quality of the data now.

What I'd like to do is at least handle periods of missing data less
than 24 hours. I'd like to put the raw data into an Excel file and
then copy the formula down to handle the missing data for me.

Thanks again for your response, the formula you provided does work.


Michael



"Frank Kabel" <[email protected]> wrote in message
Hi Michael
try the following approach (using helper columns to make the formulas a
little bit less complex)
- Add a column B as helper column and enter the following array formula
in B2 (this is your second data row as row 1 always should contain
data). Formula ist entered with CTRL+SHFT+ENTER
=IF(A2="",MIN(IF(A3:$A$20<>"",ROW(A3:$A$20),20)),"")
copy down for all rows

- add the helper column C and enter the following formula in C2 (also
array entered)
=IF(A2="",MAX(IF(A$1:A1<>"",ROW($A$1:A1),0)),"")
and copy down

- if you like you can hide columns B+C
- now enter the following formula in D2 (this is the column with your
corrected data)
=IF(A2="",SUM(INDIRECT("A" & C2 & ":A" & C2-(B2-C2-2)),INDIRECT("A" &
B2 & ":A" & B2 + (B2-C2 - 2)))/(B2-C2-1),"")
and copy down




Notes:
this approach won't work in all circumstances. e.g.
- the first row is empty (as there is no data before
- the last row ist empty
- There are not enough rows before or after
 
G

Greg Wilson

What is the rule when N is greater than the nonblank hours
following a blank period. In this case, averaging forward
N values would include blank values:-

Hour Raw_Data Corrected_Data
1 7.51 7.51
2 7.45 7.45
3 7.60 7.60
4 7.55 7.55
5 ?
6 ?
7 ?
8 ?
9 7.25 7.25
10 7.30 7.30
11 7.45 7.45
12 7.375
13 7.30 7.30
14 7.30 7.30
15 7.30 7.30
16 7.45 7.40
17 7.50 7.50
18 7.475
19 7.475
20 7.50 7.50
21 7.45 7.40
22 7.30 7.30
23 7.30 7.30
24 7.30 7.30

The macro approach doesn't require hidden columns if you
prefer. I updated my macro to include the scenario where
there is blank data for more than 24 hours. In this case
it will substitute the maximum value for the 720 hrs (i.e.
24 * 30) preceeding the blank period. Tested under highly
simplified conditions. Paste to a standard code module
and correct for wordwrap if necessary.

Sub CorrectData()
Dim BlankRng As Range, SrcRng As Range
Dim Rng As Range, C As Range, CC As Range
Dim Rw As Long, i As Integer
Dim SubsVal As Single
i = 0
Rw = Cells(Rows.Count, "A").End(xlUp).Row
Set Rng = Range(Cells(1, 2), Cells(Rw, 2))
For Each C In Rng
If C = "" Then
i = i + 1
ElseIf i > 24 Then
C.Offset(, 1) = C
Set SrcRng = Range(C.Offset(-i - 1), _
C.Offset(-i - 720)) '720 = 24 * 30 days
SubsVal = Application.Max(SrcRng)
Set BlankRng = Range(C.Offset(-i), C.Offset(-1))
For Each CC In BlankRng
CC.Offset(, 1) = SubsVal
Next
i = 0
ElseIf i > 0 Then
C.Offset(, 1) = C
Set SrcRng = Range(Range(C.Offset(-i - 1), _
C.Offset(-2 * i)), Range(C, C.Offset(i - 1)))
SubsVal = Application.Average(SrcRng)
Set BlankRng = Range(C.Offset(-i), C.Offset(-1))
For Each CC In BlankRng
CC.Offset(, 1) = SubsVal
Next
i = 0
Else
C.Offset(, 1) = C
End If
Next

Regards,
Greg
 
R

remarkl

Here's another approach. Suppose your data is in a2:a12.

in B2, enter, normally, and not as an array:

=IF($A2<>"",ROW()-ROW($A$2)+1,B1)

in C2, enter, normally, and not as an array:

=IF($A2<>"",ROW()-ROW($A$2)+1,C3)

in D1, enter, normally, and not as an array:

=AVERAGE(INDEX($A$2:$A$18,B2):INDEX($A$2:$A$18,C2))

And copy down.

The theory is that you can calculate each cell by "averaging" the tw
values nearest it, which in the case of a cell that has a value, ar
itself and itself. I tested this on a set of numbers with a gap o
one hour and two hours, and it seemed to work. As with the othe
solutions, however, it won't help if there is no first value or las
value, but that's to be dealt with later. ("ve attempted to attach
file, but I'm new here and have no idea if I'm doing the righ
thing...

Larry Kramer
(Newbie

Attachment filename: avger.xls
Download attachment: http://www.excelforum.com/attachment.php?postid=46716
 
Top