-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
HDFStore appending for mixed datatypes, including NumPy arrays #3032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
mixed means mixed 'pandas' data types (e.g. string/int/float/datetimes) you basically have 2 ways to go here:
This of this as OBT (one-big-table) pros:
cons:
This is a main table and sub-tables pros:
cons:
I don't mind making the change to support pure objects (in 1), but given your data description I think you much better off served by 2) (and possibly some wrapper code) |
I mispoke a bit There is an object type, but you cannot store this in a table (you can store it in a fixed store, basically non-searchable node). The other thing with the cons of 2), you can simply use sub-nodes if you want, then the data is 'together', pseudo code here:
|
That's probably the right way to do things for me. I always access the data in an "all-at-once" pattern, where I grab all the images for a mouse, compute some statistic (which may be time-dependent, which is why I need them all), and then save that statistic out. I then will only query on the statistic, and not the image. I'll go ahead and implement this. Thank you very much for your expertise and input. On Mar 12, 2013, at 6:17 PM, jreback [email protected] wrote:
|
great.. and the bottom could also be (to make your writing faster, assuming when you read an image/spine you read the entire one, e.g. you don't need to search on an image itself).
also, this is obviously a parallizable problem (keep in mind that you CANNOT write in parallel), but you CAN read |
And one last question, what's the most expedient way to run the line
if image is an ndarray? The call is currently telling me
|
just wrap it with a DataFrame (its a 2d ndarray) 1-d use a Series
|
And you can even do
and select out just the ones you'd like. Perfect. Thank you much. |
great...you are welcome |
FYI...to select out the 'rows' of your image (and this will be very fast, done totally in-kernal in hdf5
and you always have an index, so the following is the same as above
|
Re your comment:
This seems like the most viable option to me at this point, after wrangling a bunch of strange data.
If you point me to the location required for this support, it would be MUCH appreciated. |
unfortunately I don't believe tables will support an object type what is the data organization issue? |
I'm a research scientist, and I'm trying to record the essential variables that I measure in a set of pilot experiments. I'm recording movies of an animal, and taking scalar measurements about the environment the animal is in, like temperature, humidity, etc. The data is heterogeneous — for each frame of the video during the experiment, we also have a combination of scalars, arrays and strings that describe what occurred during that frame. Because these are pilot experiments, we don't necessarily know what is essential to measure yet. We need to be able to query over all of the scalar columns to do analysis, and we need to be able to append data. The main problem as I see it, with respect to the current capabilities of HDFStore, is heterogeneous data. I need to be able to append and query the scalar data, but each row necessarily is associated with image data. If you have a great way to organize pointers with some wrapper code so that I and my colleagues don't have to think about this kind of data scattering that's fine. The key point here, though, is reducing our mental workload, so that we can think more about the research question, and less about the data. One thing to note, though with our heterogeneous data. If we store a column "images", that contains a 2D array, you can be guaranteed that every row that has an image column will have the same size 2D array. Does this uniformity help at all? I do wish that DataFrames would take advantage of this, so if I did
it would return a 3D numpy array, as opposed to an array of objects. Do you think that's a possibility? |
It sounds like you have two issues: HDFStore support for storing object data Let me try and help with the latter, with an example of "rolling-your-own" via monkey-patching. def f(self):
import urllib
# do sql/http/whatever fetch here, based on row data
img_size = int((100+200*self['velocity']))
image = urllib.urlopen("http://placekitten.com/%s" % img_size ).read()
# can just return the data
# return image
# or even, have it display inline in IPython frontends
from IPython.display import Image
return Image(image,format='jpeg')
pd.Series.silvester=f
# each row you pick is a series, which now has a `silvester()` method
df[df['velocity'] < 0.5].irow(0).silvester() |
If there were a |
That's brilliant. I had absolutely not thought of placing the method for retrieving the pointer in the array. Do you have a recommendation of how to pack this up so the query occurs out-of-core? I have dozens of gigabytes of metadata alone. |
Not sure what's "in core" here. Or just use the filesystem. |
Try this out, this implements the solution that I pointed to above. Its pretty straightforward and should get you started. Let me know.
Here's the script
|
Yep, you're a beast. |
To answer your last question, yes, if the images are all the same shape then this is the way to store them. Creating a panel (in the store method of Animals)
Retrieving the images (in the query method)
Showing the images Panel
The selected panel (which corresponds to the mice we selected in this example
|
@alexbw close this? |
A pandas array I have contains some image data, recorded from a camera during a behavioral experiment. A simplified version looks like this:
I understand I can't query over the
image
orspine
entries. Of course, I can easily query for low velocity frames, like this:However, there is a lot of this data (several hundred gigabytes), so I'd like to keep it in an HDF5 file, and pull up frames only as needed from disk.
In v0.10, I understand that "mixed-type" frames now can be appended into the HDFStore. However, I get an error when trying to append this dataframe into the HDFStore.
I'm working with a relatively new release of pandas:
It would be immensely convenient to have a single repository for all of this data, instead of fragmenting just the queryable parts off to separate nodes.
Is this possible currently with some work-around (maybe with record arrays), and will this be supported officially in the future?
As a side-note, this kind of heterogeneous data ("ragged" arrays) is incredibly wide-spread in neurobiology and the biological sciences in general. Any extra support along these lines would be incredibly well-received.
The text was updated successfully, but these errors were encountered: