[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFDV sometimes erroneously sets min_fraction to 1.0 #58

Closed
cyc opened this issue Apr 11, 2019 · 2 comments
Closed

TFDV sometimes erroneously sets min_fraction to 1.0 #58

cyc opened this issue Apr 11, 2019 · 2 comments
Labels

Comments

@cyc
Copy link
Contributor
cyc commented Apr 11, 2019

I have not been able to reproduce this on a small test case locally, and the case where I have experienced this error involves a proprietary dataset that I can't share. But the issue is essentially this: I have two tfrecord shards of some dataset, one where a feature is always present and another where a feature is always missing (as in, the feature name is not even populated in the tf.Example). If I run GenerateStatistics on these shards while running on google cloud dataflow, the resulting stats file claims that there are 0 missing entries for that feature.

For instance, for this feature in question, the stats file in json format shows:

        {
          "numStats": {
            "commonStats": {
              "totNumValues": "181129", 
              "numNonMissing": "181129", 
              "maxNumValues": "1", 
              "numValuesHistogram": {},
              "avgNumValues": 1.0, 
              "minNumValues": "1"
            }, 
            "numZeros": "181071", 
            "histograms": [],
            "stdDev": 0.017891652618962306, 
            "max": 1.0, 
            "mean": 0.00032021377029630815
          }, 
          "name": "partially_missing_feature"
        }, 

Whereas a fully present feature looks like this:

        {
          "numStats": {
            "commonStats": {
              "totNumValues": "382166", 
              "numNonMissing": "382166", 
              "maxNumValues": "1", 
              "numValuesHistogram": {},
              "avgNumValues": 1.0, 
              "minNumValues": "1"
            }, 
            "numZeros": "101146", 
            "histograms": [],
            "median": 1.0, 
            "stdDev": 0.4301544503153421, 
            "max": 1.0, 
            "mean": 0.6821856555482622
          }, 
          "type": "FLOAT", 
          "name": "fully_present_feature"
        }, 

Please let me know if this is expected behavior.

@paulgc
Copy link
Member
paulgc commented Apr 11, 2019

@cyc This appears to be a bug. The "numMissing" of the "partially_missing_feature" should be 201037.

@paulgc
Copy link
Member
paulgc commented Apr 18, 2019

@cyc This bug should be fixed with 9d98c7a

Let us know if you observe any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants